Beyond p values and significance "Accepting the null hypothesis" Power Utility of a result
Showing that things are NOT different Example: Oates and Heeringa wanted to show that their grammar induction algorithm performed "the same" as the inside/outside algorithm. Approaches: Confidence interval around the difference Power analysis Showing that the proportion of variance due to algorithm is smaller than the proportion due to problem (analysis of variance) Showing that the difference, though significant, is meaningless
"Accepting the null hypothesis" Sometimes you want to show A and B are not different Hypothesis testing doesn't allow that! Ok, can we say A and B are equal if we cannot reject Ho: A = B? = (A) (B) s.e. (A ) (B) = (A) (B) ˆ (A) (B) N
So, what can we do to "accept the null hypothesis"? 0 1 0 1 If these are sampling distributions of A B, which makes you more confident that A B 0?
Example: Animal Watch Total Math Problems Male/Female 30 20 10 20 10 1 2 3 4 5 6 7 8 9 MALES 1 2 3 4 5 6 7 8 9 FEMALES "Accept" the hypothesis that male and female scores are equal? t = 4.23 3.63 se x male x female = x 1.96se x male x female 4.23 3.63.4 =.6.4 = 1.5 not significant male female x +1.96se x male x female 0.6 1.96(.4) male female 0.6 +1.96(.4) 0.184 male female 1.38
Boostrap distribution of the difference between male and female scores. Confidence interval [-0.184, 1.39] (defun two-sample-bootstrap (sample1 sample2 statistic k) (let* ((n1 (length sample1)) (n2 (length sample2)) (s1* (make-array n1)) (s2* (make-array n2)) (dist nil)) (dotimes (i k) (dotimes (j n1)(setf (aref s1* j)(nth (random n1) sample1))) (dotimes (j n2)(setf (aref s2* j)(nth (random n2) sample2))) (push (funcall statistic s1* s2*) dist)) (values dist))) (two-sample-bootstrap m f #'(lambda(x y)(- (mean x)(mean y))) 500)) 40 30 20 10 0 0.5 1 1.5
Errors and Power Type I error: Rejecting H0 when H0 is true Type II error: Failing to reject H0 when H0 is false Power: 1 - Pr(Type II error) = Pr(rejecting H0 when H0 is false) Pr(Type I error) Power H0 H1 0 3 Critical value to reject H0 at, say, α =.05
Power and H1 Power can be assessed only with respect to H1. You must specify H1 before you can calculate the power of a test. H0 H1 0 3 Critical value to reject H0 at, say, α =.05
Example: What is the power of a t test to find a difference of at least.5 between the means of males and females? 30 20 10 H0: µ males µ females = 0 H1: µ males µ females = 0.5 1 2 3 4 5 6 7 8 9 MALES H0 H1 20 10 0.5 1 2 3 4 5 6 7 8 9 FEMALES
Example: What is the power of a t test to find a difference of at least.5 between the means of males and females? H0: µ males µ females = 0 H1: µ males µ females = 0.5 H0 From earlier slides we know the standard error of the difference between the means is 0.4, so the one-tailed critical value is 1.645 x.4 =.658 Assuming the H1 sampling distribution has the same form,.658 is 0.158 /.4 =.395 standard error units away from the mean of the H1 distribution. H1 34.6% of a normal curve lies beyond.395 standard deviations from the mean The power of the test to detect a difference of.5 is.346. 0.5 0.658
Which factors in a test affect power? H0 H1 0.5 0.658
Which factors in a test affect power? Standard error (variance, sample size), effect size, alpha H0 H1 0.5 0.658
Power curves: Fix three of the factors, vary one H0 H1 Power 0.5 Mean under H1 H0 H1 0.75
Power curves N for normal sampling distributions Crit.05 = 1.645 N 2 As N increases, Crit.05 decreases and power increases 1.5 H0 H1 1 0.5 0.5 50 100 150 200 250 300 N Change in Crit.05 for the male/female test data, assuming variances for males and females remain constant
Yes there's a difference, but does it mean anything? People make a big deal over differences in mathematics scores between boys and girls. These differences are tiny compared with those between American and Japanese students The difference between KOSO and KOSO* raw runtimes is tiny compared with the random effect of the problem on which they are tested
Significant and meaningful are not synonymous In the RKF summer trials, knowledge engineers (KEs) got significantly higher scores than naïve users (SMEs) (p <.0001). How much predictive power does this knowledge afford? Suppose you wanted to predict whether a score was higher or lower than the median of all scores. How much would it help to know whether the score belonged to a KE or SME? 50 40 30 20 10 SMEs N = 277 161 values < 2.59 100 KEs+SMEs N = 417 Median = 2.59 60 50 40 30 20 10 KEs N = 140 101 values > 2.59 1 2 1 2 1 2
Guess whether x > 2.59. Error reduction by knowing whether x belongs to an SME or a KE: No knowledge: 417 / 2 = 208.5 expected errors if you say x > 2.59 You know x comes from an SME. Guess x < 2.59 and make 277-161 = 116 errors You know x comes from a KE. Guess x > 2.59 and make 140-101 = 39 errors Error reduction is (208.5 - (116 + 39) ) / 208.5 = 25.6%. 50 40 30 20 10 SMEs N = 277 161 values < 2.59 100 KEs+SMEs N = 417 Median = 2.59 60 50 40 30 20 10 KEs N = 140 101 values > 2.59 1 2 1 2 1 2
Significant and meaningful are not synonyms Suppose you wanted to use the knowledge that the ring is controlled by KOSO or KOSO* for some prediction How much predictive power would this knowledge confer? Grand median k = 1.11; Pr(trial i has k > 1.11) =.5 Probability that trial i under KOSO has k > 1.11 is 0.57 Probability that trial i under KOSO* has k > 1.11 is 0.43 Predict for trial i whether k > 1.11: If it s a KOSO* trial you ll say no with (.43 * 150) = 64.5 errors If it s a KOSO trial you ll say yes with ((1 -.57) * 160) = 68.8 errors If you don t know which you ll make (.5 * 310) = 155 errors 155 - (64.5 + 68.8) = 22 Knowing the algorithm reduces error rate from.5 to.43
Stay/go decision An epoch: Collect several views of an object and give them a common (but new) label The robot has experienced M epochs and is k views into the current epoch Should it collect more views or go? Intuition: If additional views cannot help it discriminate the current object from others in memory, it should go. Model: You have a sample s1 and you are accumulating data into s2. When the data do not improve the discrimination of s1 and s2, stop sampling.
Stay/go math φ = SSg (SSa + SSb) SSg Theoretical maximum value when Nb = Na std(a)/std(b) (from Paola Sebastiani)
Stay/go experiments 0.06 0.05 A1 different A2 similar φ 0.04 0.03 A1 0.02 A2 0.01 50 100 Number of Views
Significant and meaningful (or useful) are not synonyms Suppose you wanted to predict the run-time of a trial. If you don t know Algorithm, your best guess is the grand mean and your uncertainty is the grand variance. If you do know Algorithm then your uncertainty is less: 2? 2 2? Algorithm = ˆ 2 =? 2 Reduction in uncertainty due to knowing Algorithm t 2 1 t 2 + N 1 + N 2 1 ˆ 2 = Estimate of reduction in variance (recall t = 2.42 from Rosenberg study) 2.42 2 1 2.42 2 + 160 + 150 1 =.015 All other things equal, increasing sample size decreases the utility of knowing the group to which a trial belongs