Hypothesis Testing. Framework. Me thodes probabilistes pour le TAL. Goal. (Re)formulating the hypothesis. But... How to decide?

Size: px

Start display at page:

Download "Hypothesis Testing. Framework. Me thodes probabilistes pour le TAL. Goal. (Re)formulating the hypothesis. But... How to decide?"

Anne Foster
5 years ago
Views:

Hypothesis Testing Me thodes probabilistes pour le TAL Framework Guillaume Wisniewski guillaume.wisniewski@limsi.

eaten salted butter since her childhood we have to decide on an operational definition of intelligence IQ test how to test this hypothesis / check that experimental evidence supports it something we

to accept or reject an hypothesis 2 How to decide? 3 But... Salted-IQ distribution something we simply cannot find out Compare the distributions of we only have access to one score!

1 Hypothesis Testing Me thodes probabilistes pour le TAL Framework Guillaume Wisniewski novembre 207 Universite Paris Sud & LIMSI Goal (Re)formulating the hypothesis Example express the hypothesis on a form that can be tested claim/hypothesis of Nadi : his genius is due to having eaten salted butter since her childhood we have to decide on an operational definition of intelligence IQ test how to test this hypothesis / check that experimental evidence supports it something we can measure convention between experimenters Hypothesis Testing research hypothesis : people eating salted butter are cleverer / have an higher IQ procedure = logical sequence of steps decide whether to accept or reject an hypothesis 2 How to decide? 3 But... Salted-IQ distribution something we simply cannot find out Compare the distributions of we only have access to one score! IQ of people that are eating salted butter (usual-iq distribution) IQ of people that are not eating salted butter (salted-iq distribution) Usual-IQ distribution can we give everyone an IQ-test? alternative : assume that the IQ scores are normally distributed 4 the creators of IQ tests deliberately constructed them so that the scores are distributed according to N (00, 5) 5

Null Hypothesis Testing the null hypothesis What we have... one distribution out of two is known, but cannot test our research hypothesis.

unless the data give strong evidence of the contrary 6 Principle assume that salted-iq and usual-iq distributions are the same test whether Nadi IQ score comes from the

under the Gaussian curve at the right of z () 7 Intepreting z-scores frequency 0.8 0.6 0.4 0.

8% of the usual-iq distribution have a IQ score higher than Nadi not that impressive Nadi s IQ = 45 z = 3 0.

distributions are the same 9 General interpretation Significance Level General principle hypothesis testing is a gamble on the basis of probabilities.

2 Null Hypothesis Testing the null hypothesis What we have... one distribution out of two is known, but cannot test our research hypothesis... but we can test the null hypothesis H 0 : usual-iq and salted-iq distributions are the same Null Hypothesis consider H 0 innocent until proven guilty assume H 0 is true unless the data give strong evidence of the contrary 6 Principle assume that salted-iq and usual-iq distributions are the same test whether Nadi IQ score comes from the usual-iq distribution In practice compute the z-score : z = x µ σ distance between the raw score and the population mean in units of the standard deviation z table : area under the Gaussian curve at the right of z () 7 Intepreting z-scores frequency z value of the z-table z Probability to observe a value larger that z In our case Nadi s IQ = 20 z =.33 regarding the z-table : 9.8% of the usual-iq distribution have a IQ score higher than Nadi not that impressive Nadi s IQ = 45 z = 3 0.3% of score higher than Nadi s more likely that this score belongs to a different, higher, distribution 8 we rely on the assumption that the salted-iq and usual-iq distributions are the same 9 General interpretation Significance Level General principle hypothesis testing is a gamble on the basis of probabilities. If the probability of Peter s score coming from a distribution the same as the usual-iq distribution is very low we reject the null hypothesis, if the probability is not very low we accept it. When should we switch from rejection to acceptance? Significance Level reject the H 0 with a signicance level of 0.05 the score of the unknown distribution can only arise from the known distribution with a chance of less than 5% decision criterion 0

3 Vocabulary Example i One- and two-tailed predictions. The unknown distribution is the same as the known distribution. 2. The unknown distribution is higher up the scale than the known distribution. 3. The unknown distribution is lower down the scale than the known distribution. Principle toss coin n times We are tossing a coin. Is it fair? coin suspicious if number of heads is much less or much more than n Example ii Application Hypothesis c : probability to observe head H 0 : c = 0.5 H A : c 0.5 (alternate/research hypothesis) Statistical test ĉ number of heads in n tosses standard deviation of ĉ is test statistic : c ( c) n ĉ c z = (2) n c ( c) First n = 00, ĉ = 0.62 we have : z = 2.4, value in z table : 0.82% we reject H 0 at the 5% level Second n = 00, ĉ = 0.47 we have : z = 0.6, value in z table : 27.43% c is not significantly different from 0.5 at the 5% level 4 5 What we have seen? Simplest case Generalization one known distribution + normal distribution one sample Generalization(s) hypothesis one the usual distribution : shape, µ known, σ known,... one sample / two samples 6

In practice Example : length of sentences the spirit is the same (lots of) technical difficulties e.g. Student distribution instead of a normal distribution when the variance is not known non-parametric tests Data Mean sentence length in 50 novels from 950s : X = 9.

3 Mean sentence length in 50 novels from 2000s : X = 6.4 X is normally distributed with variance σ 2 = 34.2 Test statistic (difference in estimated mean) Z = X X 2 = σ 2 n +n 2 9.3 6.9 34.2 50+50 = 2.

4 In practice Example : length of sentences the spirit is the same (lots of) technical difficulties e.g. Student distribution instead of a normal distribution when the variance is not known non-parametric tests Data Mean sentence length in 50 novels from 950s : X = 9.3 Mean sentence length in 50 novels from 2000s : X = 6.4 X is normally distributed with variance σ 2 = Example : length of sentences Data Mean sentence length in 50 novels from 950s : X = 9.3 Mean sentence length in 50 novels from 2000s : X = 6.4 X is normally distributed with variance σ 2 = 34.2 Test statistic (difference in estimated mean) Z = X X 2 = σ 2 n +n = 2.28 (3) What is wrong with significance testing? Conclusions p = Reject H 0 at α = 5% (but not at α = %) 8 History It s Not Easy Being Greene (ER Season 2, ep. 3) most of the concepts were developed by Sir Ronald Fischer in the 920s a genius who almost single-handedly created the foundations for modern statistical science strong opposition from the very beginning at the core of most scientific results, founding principles of the design of experiments Benton Are you serious? Vucelich Simon did an analysis of our result. Our P-value was We are one successful outcome away from statistical significance. Benton We can publish? Vucelich Soon. One more aneurysm and our numbers will blind the most dubious skeptics. After that, we head to D.C. to play dog-and-pony for the FDA. Now, Simon doesn t fly, so he stays here which makes you the next choice for Clamp-and-Run Ambassador to Europe.... Vucelich You ve gotta find another patient soon because the Norwegians are doing a similar study. And, Peter, we cannot let the Vikings pillage our thunder. 9 20

Definition of H 0 What is significance? in the dice example : H 0 : c = 0.5 in practice : c will never be exactly 0.5. What is important is that it must be close to 0.

what if the new drug has much worse side effects and costs a lot more (a given, for a new drug). 2 22 Impact of sample size What should we do instead?

5 Definition of H 0 What is significance? in the dice example : H 0 : c = 0.5 in practice : c will never be exactly 0.5. What is important is that it must be close to 0.5 but less tractable We know a priori that H 0 is false textbook case : compare a new drug to an old drug new drug works 0.4% (i.e ) better than the old one is the new one is significantly better? what if the new drug has much worse side effects and costs a lot more (a given, for a new drug) Impact of sample size What should we do instead? Recall that in the dice example : ĉ c z = (4) n c ( c) to make z arbitrarily small, just increase n as the sample size increases, eventually everything becomes significant in NLP, n is always large! every educated person should understand statistics and hypothesis testing! Possible solution form a confidence interval : if n is large enough : estimation reasonably accurate location of the interval = answer to the question Example (coins) confidence interval : [0.502, 0.504] close enough of 0.5 (even if 0.5 not in it) + very narrow but no automatic decision On the importance of automatic decision I was in search of a one-armed economist, so that the guy could never make a statement and then say : on the other hand (President Harry S. Truman) foolish to expect to prove anything in a mathematical sense statistics = one piece of evidence must be weighted and combine to other information preponderance of all the evidence but : lots of discussions Evaluating classifiers performance in NLP 25

The task Difficulties Accuracies of two PoS tagger across 0 datasets Context compare a new system A to a baseline system B : is A better than B on some large population of data what can we conclude

6 The task Difficulties Accuracies of two PoS tagger across 0 datasets Context compare a new system A to a baseline system B : is A better than B on some large population of data what can we conclude if A beats B on one particular dataset? by chance victories Main problem : (almost) impossible to draw new test sets from the underlying population effect size δ(x) = s A (x) s B (x) (difference of score on dataset x) δ(x) is not normally distributed δ(x) does not follow any well-studied distribution many bias (e.g. sample size) In practice : paired bootstrap Impact on test set size Show me the code. Draw b bootstrap samples x (i) of size n by sampling with replacement from x 2. initialize s = 0 3. For each x (i) increment s if δ ( x (i)) > 2 δ(x) 4. Estimate p s b Interpretation how often A beats B by more than δ(x) on x (i)? factor 2 : x (i) is drawn from x we expect A to beat B by δ(x) for at least half of the x (i) mean correction Conclusion References What s in a p-value in NLP?, A. Søgaard, A. Johannsen, B. Plank, D. Hovy and H. Martínez Alonso, Conference Taylor Berg-Kirkpatrick, David Burkett, and Dan Klein, An empirical investigation of statistical significance in NLP, EMNLP (Stroudsburg, PA, USA), Association for Computational Linguistics, 202, pp Anders Søgaard, Anders Johannsen, Barbara Plank, Dirk Hovy, and Héctor Martínez Alonso, What s in a p-value in nlp?, CoNLL (Ann Arbor, Michigan), Association for Computational Linguistics, June 204, pp. 0. on Computational Language Learning,

PSY 305. Module 3. Page Title. Introduction to Hypothesis Testing Z-tests. Five steps in hypothesis testing

PSY 305. Module 3. Page Title. Introduction to Hypothesis Testing Z-tests. Five steps in hypothesis testing Page Title PSY 305 Module 3 Introduction to Hypothesis Testing Z-tests Five steps in hypothesis testing State the research and null hypothesis Determine characteristics of comparison distribution Five