Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, PDF Free Download

Hypothesis Testing Part I James J. Heckman University of Chicago Econ 312 This draft, April 20, 2006 1

1 A Brief Review of Hypothesis Testing and Its Uses values and pure significance tests (R.A. Fisher) focus on null hypothesis testing. Neyman-Pearson tests focus on null and alternative hypothesis testing. Both involve an appeal to long run trials. They adopt an ex ante position (justify a procedure by the number of times it is successful if used repeatedly). 2

2 Pure Significance Tests Focuses exclusively on the null hypothesis Let ( 1 ) be observations from a sample. Let ( 1 ) be a test statistic. If (1) We know the distribution of ( ) under 0,and (2) The larger the value of ( ), the more the evidence against 0, then =Pr( : 0 ). This is evidence against the null hypothesis. 3

Observethatthe valueisauniform(0 1) variable. For random variable with density (absolutely continuous with Lebesgue measure) = () is uniform for any given that is continuous. (Prove this for yourself. It is automatic from the definition.) value probability that would occur given that 0 is a true state of aairs. test or test for a regression coecient is an example; The higher the test statistic, the more likely we reject. Ignores any evidence on alternatives. 4

R.A. Fisher liked this feature because it did not involve speculation about other possibilities than the one realized. values make an absolute statement about a model. 5

Questions I consider: (1) How to conduct a best test? Compare alternative tests. Any monotonic transformation of the statistic produces the same value. (2) Pure significance tests depend on the sampling rule used. This is not necessarily bad. (3) How to pool across studies (or across coecients)? 6

2.1 Bayesian vs. Frequentist or Classical Approach ISSUES: (1) In what sense and how well do significance levels or values summarize evidence in favor of or against hypotheses? (2) Do we always reject a null in a big enough sample? Meaningful hypothesis testing Bayesian or Classical requires that significance levels decrease with sample size; (3) Two views: =0tests something meaningful vs. =0 only an approximation, shouldn t be taken too seriously. 7

(4) How to test non-nested models (Cox-non-nested test and extensions) Classical form: Set 0 = 1 = 0 : = 0 0 + 0 1 : = 1 1 + 1 Test = = 0 0 + 1 1 + Test vs. 0 (this is asymmetric) and is not hypothesis of interest. Theil 2 test (Its asymmetric in stating the null) See Leamer Chapter 3. 8

(5) How to quantify evidence about model? (How to incorporate prior restrictions?) What is strength of evidence? (6) How to account for model uncertainty: fishing, etc. First consider the basic Neyman-Pearson structure- then switch over to a Bayesian paradigm. 9

Useful to separate out: (1) Decision problems From (2) Acts of data description. This is a topic of great controversy in statistics. 10

Question: In what sense does increasing sample size always lead to rejection of an hypothesis? If null not exactly true, we get rejections (The power of test 1 for fixed sig. level as sample size increases) 11

Example to refresh your memory about Neyman-Pearson Theory. Take one-tail normal test about a mean: What is the test? Assume 2 is known. 0 : 0 2 : 2 12

For any we get Pr Ã 0 p 2 0 p 2! = () (Exploit symmetry of standard normal around the origin). For a fixed, wecansolvefor (). () = 0 1 () 13

Now what is the probability of rejecting the hypothesis under alternatives? (The power of a test). Let be the alternative value of.fix to have a certain size. (Use the previous calculations) Pr Ã p 2 p 2 =Pr! ³ 0 ³ 1 (). We are evaluating the probability of rejection when we allow to vary. 14

Thus = Pr ³ 0 ³ 1 () = when 0 = If 0, this probability goes to one. This is a consistent test. 15

Now, suppose we seek to test 0 : 0.Use fixed If 0 is true: ³ 0 0 ³ The distribution becomes more and more concentrated at 0. We reject the null unless 0 =. 16

Power of the test for T elements of Data 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Prob[mean(X ) > c ( )] 1 0 Prob[mean(X ) > c ( )] 2 0 Prob[mean(X ) > c ( )] 10 0 Prob[mean(X ) > c ( )] 100 0 Prob[mean(X ) > c ( )] 1000 0 0 0 1 2 3 4 5 True value ( = 0.05, = 1, =0) A 0 17

Parenthetical Note: Observe that if we measure with the slightest error and the errors do not have mean zero, we always reject 0 for big enough. 18

Design of Sample size Suppose that we fix the power =. Pick (). What sample size produces the desired power? We postulate the alternative = 0 +. 19

Pr ³ 0 ³ 1 () = Ã 1 ()+ 0! = 1 () = 1 ()+ 0 [ 1 () 1 ()] = 20

Minimum needed to reject null at specified alternative. Has power of for eect size. Pick sample size on this basis: (This is used in sample design) What value of to use? 21

Observe that two investigators with same but dierent sample size have dierent power. This is often ignored in empirical work. Why not equalize the power of the tests across samples? Why use the same size of test in all empirical work? 22

3 Alternative Approaches to Testing and Inference 3.1 Classical Hypothesis Testing (1) Appeals to long run frequencies. (2) Designs an ex ante rule that on average works well. e.g. 5% ofthetimeinrepeatedtrialswemakeanerror of rejecting the null for a 5% significance level. (3) Entails a hypothetical set of trials, and is based on a long run justification. 23

(4) Consistency of an estimator is an example of this mindset. E.g., = + ( ) 6= 0; OLS biased for. Suppose we have an instrument: ( ) =0 ( ) 6= 0 ( ) plim = + () ( ) plim = + = ( ) {z } =0 24

Another consistent estimator (1) Use for first 10 100 observations (2) Then use IV Likely to have poor small sample properties. But on a long run frequency justification, its just fine. 25

3.2 Examples of why some people get very unhappy about classical procedures Example 1. Sample size: =2 ( 1 2 ) 1 2 0 ( = 0 1) = ( = 0 +1)= 1 ; =12 2 One possible (smallest) confidence set for 0 is ( 1 2 )= ½ 1 2 ( 1 + 2 ) if 1 6= 2 1 1 if 1 = 2 26

Thus 75% of the time ( 1 2 ) contains 0 (75% of repeated trials it covers 0 ). (Verify this) Yet if 1 6= 2,wearecertain that the confidence interval exactly covers the true value. 100% ofthetimeitisright. Ex post or conditional on the data, we get the exact value. 27

Example 2. (D.R. Cox) (1) You have data, say on DNA from crime scenes. (2) You can send data to New York or California labs. Both labs seem equally good. (3) Toss a coin to decide which lab analyzes data. Should the coin flip be accounted for in the design of the test statistic? 28

Example 3. Test H 0 : θ = 1 vs. H A : θ =1. X N (θ,.25) Consider rejection region: Reject if X 0. If we observe X =0,wewouldhaveα =.0228 Size is.0228; power under alternative is.9772. Looks good, but if we reverse roles of null and alternative, it would also look good.

Probability Density Function Example 3 X N,0.25; H 0 : 1; H A : 1. Test : Reject H 0 if X 0. φ((x- μ 0 )/σ) 0.5 φ((x- μ 1 )/σ) 0.4 0.3 0.2 0.1 0-4 -3-2 -1 0 1 2 3 4 x, μ A = 1, μ 0 = -1, σ 2 = 0.25 In the case of 0 being observed: Power 0. 0228 30

Example 4. {1 2 3}; wehavetwopossibleoutcomes 0 and 1 1 2 3 0 009 001 99 1 001 989 01 Consider test: Accepts 0 when =3and accepts 1 otherwise ( = 01 and = 99 high power). 31

If we observe =1we reject. But the likelihood ratio in favor of 0 is 009 001 =9 32

Example 5. 1 2 3 0 005 005 99 1 0051 9489 01 Reject 0 when =1 2 Power = 99, Size= 01. Is it reasonable to pick 1 over 0 when =1is chosen? (Likelihood ratio not strongly supporting the hypothesis) 33

Example 6. (Lindley and Phllips; American Statistician, August, 1976). Consider an experiment. We draw 12 balls from an urn. The urn has an infinite number of balls. = probability of black. (1 ) =probability of red. 34

Red and black are equally likely on each trial and trials are independent. μ 12 Pr ( is black) = (1 ) 12. Suppose that we draw 9 black balls and 3 red balls. What is the evidence in support of the hypothesis that = 1 2? 35

We might choose a critical region = {9 10 11 12} to reject null of = 1 2 = Pr( {9 10 11 12}) ½μ μ μ 12 12 12 = + + 3 2 1 We do not reject for = 05 + μ ¾ μ 12 12 1 =75% 0 2 This sampling distribution assumes that 10 black and 2 reds is a possibility. (It is based on a counterfactual space of what else could occur and with what possibility). 36

Now consider an alternative sampling rule. Draw balls until 3 red balls are observed and then stop. (So 10 blacks and 2 reds on a trial of 12 not possible as they were before.) Distribution of 2 ( in this experiment) is μ 2 +2 2 (1 ) 3 2 Rejection region 2 = {9 10 11 12 13} i.e., if 2 9, reject Pr( {9 10 11 12 13}) =325% Now significant. Reject null of equality. 37

Now suppose you have in both cases 9 black and 3 red on a single trial. 38

In computing values and significance levels, you need to model what didn t occur. Depends on the stopping rule and the hypothetical admissible sample space. 39

3.3 Additional Problems with Classical Statistics Classical: Design of what could happen. Aects values. Trials with 3 results 1, 2, 3 (from Hacking) Hypotheses (1) (2) (3) vs. 001 095 004 vs. 0 095 005 vs. 000001 095 004999 Reject for i 1 occurs. This is not the most powerful test in fact, it is a biased test. (size = 01 forlessthanpower=0). 40

Power (from Hacking) (1) (2) (3) (4) 0 001 001 098 001 001 097 001 Test 1: reject if 3 occurs; size =001, power=097. Test 2: reject i 1 or 2 occur; power= 002. 2 is inferior to 1 in power, but if 1 occurs, reject and then we know it has to be. 2 is a much better ex post test than 1. 41

3.4 Likelihood Principle All of the information is in the sample. Look at the likelihood as best summary of the sample. 42

Likelihood Approach Recall from our discussion of asymptotics Part III that under the regularity conditions of Thms. 2 and 3, because ˆ (ˆ) = ( 0 )+ 1 2 (ˆ 0 ) 0 2 0 = = 0 for all ˆ; ( 0 ) = ln $( 0) where in the i.i.d. case: ln $(ˆ) 0 (ˆ 0 )+ (1) 43

In terms of the information matrix, for the likelihood case (ˆ) =( 0 ) 1 2 (ˆ 0 ) 0 0 (ˆ 0 )+ (1) So we know that as, the likelihood $ looks like a normal, e.g., 1 N ( ) = exp 1 (2) 2 2 ( )0 1 ( ) ln N ( ) = ln(2) 2 ln 1 2 ( )0 1 ( ) 44

So the likelihood is converging to a normal-looking criterion andhasitsmodeat 0. The most likely value is at the MLE estimator (mode of likelihood is 0 ). 45

3.5 Bayesian Principle Use prior information in conjunction with sample information. Place priors on parameters. Classical Method and Likelihood Principle sharply separate parameters from data (random variables). The Bayesian method does not. All parameters are random variables. 46

Bayesian and Likelihood approach both use likelihood. Likelihood: Use data from experiment (Evidence concentrates on 0 ) Bayesian: Use data from experiment plus prior. 47

Bayesian Approach postulates a prior (). This is a probability of Compute using posterior (Bayes Theorem): posterior z } { ( ) =TL( ) {z } likelihood prior z} { (), where T is a constant defined so posterior integrates to 1. Get some posterior independent of constants (and therefore sampling rule). 48

3.6 Definetti s Thm: Let denote a binary variable {0 1} i.i.d. Pr( =1)= Pr( =0)=1 Let ( ) =probability of = 1 and 0. Z 1 μ + ( ) = (1 ) () 0 For some () 0 (This is just the standard Hausdor moment problem). 49

For this problem a natural conjugate prior is () = 1 (1 ) 1 ( ) 0 1 = =1, uniform () = + 50

Beta Probability Density Function 2.5 2 1.5 b = 0.1 b = 0.5 b = 1 b = 2 b = 3 b = 5 b = 10 1 0.5 0 0 0.2 0.4 0.6 0.8 1 of Beta(,a,b), a =0.1 51

Beta Probability Density Function 5 4.5 4 3.5 3 b = 0.1 b = 0.5 b = 1 b = 2 b = 3 b = 5 b = 10 2.5 2 1.5 1 0.5 0 0 0.2 0.4 0.6 0.8 1 of Beta(,a,b), a =0.5 52

Beta Probability Density Function 10 9 8 7 6 b = 0.1 b = 0.5 b = 1 b = 2 b = 3 b = 5 b = 10 5 4 3 2 1 0 0 0.2 0.4 0.6 0.8 1 of Beta(,a,b), a =1 53

Beta Probability Density Function 10 9 8 7 6 b = 0.1 b = 0.5 b = 1 b = 2 b = 3 b = 5 b = 10 5 4 3 2 1 0 0 0.2 0.4 0.6 0.8 1 of Beta(,a,b), a =2 54

Beta Probability Density Function 10 9 8 7 6 b = 0.1 b = 0.5 b = 1 b = 2 b = 3 b = 5 b = 10 5 4 3 2 1 0 0 0.2 0.4 0.6 0.8 1 of Beta(,a,b), a =3 55

Beta Probability Density Function 10 9 8 7 6 b = 0.1 b = 0.5 b = 1 b = 2 b = 3 b = 5 b = 10 5 4 3 2 1 0 0 0.2 0.4 0.6 0.8 1 of Beta(,a,b), a =5 56

Beta Probability Density Function 10 9 8 7 6 b = 0.1 b = 0.5 b = 1 b = 2 b = 3 b = 5 b = 10 5 4 3 2 1 0 0 0.2 0.4 0.6 0.8 1 of Beta(,a,b), a =10 57

Beta Probability Density Function 10 9 8 7 6 5 4 3 b = 1 b = 2 b = 3 b = 4 b = 5 b = 6 b = 10 5 4.5 4 3.5 3 2.5 2 1.5 1 2 1 0 0 0.2 0.4 0.6 0.8 1 of Beta(,a,b), a =1 0.5 0 3 2.5 2 1.5 1 0.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 of Beta(,,)X ; ( = 1) ( = 1) 1 58

Beta Probability Density Function 10 9 8 7 6 5 4 b = 1 b = 2 b = 3 b = 4 b = 5 b = 6 b = 10 8 6 4 2 3 2 1 0 0 0.2 0.4 0.6 0.8 1 of Beta(,a,b), a =2 0 1 0.8 0.6 0.4 0.2 of Beta(,a,b) ; (a = 2) 0 3 2.5 2 1.5 b (a = 2) 1 0.5 59

Beta Probability Density Function 10 9 8 7 6 b = 1 b = 2 b = 3 b = 4 b = 5 b = 6 b = 10 5 4 3 2 1 0 0 0.2 0.4 0.6 0.8 1 of Beta(,a,b), a =1 60

5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 3 2.5 2 ( = 1) 1.5 1 0.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 of Beta(,,)X ; ( = 1) 1 61

Beta Probability Density Function 10 9 8 7 6 b = 1 b = 2 b = 3 b = 4 b = 5 b = 6 b = 10 5 4 3 2 1 0 0 0.2 0.4 0.6 0.8 1 of Beta(,a,b), a =2 62

8 6 4 2 0 1 0.8 0.6 0.4 0.2 of Beta(,a,b) ; (a = 2) 0 3 2.5 2 1.5 b (a = 2) 1 0.5 63

Posterior ( ) = (1 ) {z } likelihood 1 (1 ) 1 {z } prior where is the data and is a normalizing constant to make density normalize to one: Z (1 ) 1 (1 ) 1 =1 Observe crucially that the normalizing constant is the same for both sampling rules we discussed in the red ball and black ball problem. Why? Becausewechoose to make ( ) integrate to one. 64

Mean of posterior with prior posterior () = + ( + )+( + ) Notice: the constants that played such a crucial role in the sampling distribution play no role here. They vanish in defining the constant. mode of = 1 ( 1) + ( 1) Likelihood corresponds to ( + ) trialswith red and black. Prior corresponds to ( + 2) trialswith( 1) redand ( 1) black. 65

Empirical Bayes Approach Estimate Prior. Go to Beta-Binomial Example. ( ) = Z 1 0 + (1 ) 1 (1 ) 1 ( + ) Now is a heterogeneity parameter distributed ( ). = + ( + 1+ 1) ( + ) Estimate and as parameters from a string of trials with reds and blacks. is a person-specific parameter. 66

Similar idea in the linear regression model = +.(Use notes from the second day of classes for proof of identification.) (We can identify means and variances.) = + ( ) Assume. = + ( )= = +( + ) {z } 2 = 2 + 0 Use squared OLS residuals to identify given. 67

Notice: We can extend the model to allow = + and identify (Hierarchical model). 68

Digression: Take the Classical Normal Linear Regression Model = + ( 0 )= 2 ˆ =( 0 ) 1 0 (ˆ) = 2 ( 0 ) 1 Assume 2 known. Take a prior on. N 2 () 1 Posterior is normal: posterior ³ N +( 0 ) 1 ³ 1 +( )ˆ 0 2 ( + 0 ) 1 69

Thus, we can think of the prior as a sample of observations with the ( 0 ) matrixbeing and the sample OLS from prior being. Compare to = + OLS is ( 0 + 0 ) 1 ( 0 + 0 ) =( 0 ) 1 0 =( 0 ) 1 0 (Prove this.) See Leamer for more general case where 2 is unknown (gamma prior). 70

To compute evidence on one hypothesis vs. another hypothesis use posterior odds ratio Pr( 1 ) Pr( 0 ) = Pr( 1) Pr( 0 ) Pr( 1 ) Pr( 0 ) Hypotheses are restrictions on the prior (e.g. dierent values of ( )) e.g. uniform vs. is equally likely. 71

Classical Approach Bayesian Approach Assumption regarding experiment Events independent, Events form exchangeable sequences given a probability Interpretation of probability Relative frequency; Degrees of belief; applies only to repeated events applies both to unique and to sequences of events Statistical inferences Based on sampling distribution; Based on posterior distribution; sample space or stopping rule prior distribution must be assessed must be specified Estimates of parameters Requires theory of estimation Descriptive statistics of the posterior distribution Intuitive judgement Used in setting significance levels, Formally incorporated in the in choice of procedure, and in other prior distribution ways Source: Lindley, D.V. and Phillips, L.D. (1976). Inference for a Bernoulli Process (A Bayesian View). American Statistician 30(3): 112-119 72

4 Bayesian Testing Point null vs. Point Alternative test (Leamer Example): Think of a regression model = 1 + 1 vs. = 0 + 0 2 Hypotheses: 1 0 Posterior odds ratio Pr ( 1 ) Pr ( 0 ) = Bayes factor Pr ( 1 ) Pr ( 0 ) Prior odds ratio Pr ( 1 ) Pr ( 0 ) 73

Predictive density : Z Z ( )= 2 2 Likelihood 2 Prior density Evidence supports the higher posterior probability model. 74

Example: ; 2 ; 2 0 : 0 =0 =1 1 : 1 =1 =1 0 : (0 1 ) 1 : (1 1 ) 75

Typical Neyman-Pearson Rule: Reject 0 if Accept 0 if Type 1 and Type 2 errors: () = Pr =0 () = Pr =1 Example: =05, = =031. 76

Bayes Approach Pr 0 = 0 Pr (0 ) = 0 Pr (0 ) 0 Pr (0 )+ 1 Pr (1 ) Pr 1 = 1 Pr (1 ) 0 Pr (0 )+ 1 Pr (1 ) 77

Pr 0 Pr 1 = 0 Pr (0 ) 1 Pr (1 ) = exp 1 h 2 + i 2 Pr ( 0 ) 1 2 Pr ( 1 ) = exp 1 2 2 + 2 Pr ( 0 ) 2 Pr ( 1 ) = exp 1 Pr ( 0 ) 2 2 Pr ( 1 ) (Recall 2 =1under null and alternatives.) 78

2 Ã Pr 0! ln Pr 1 μ Pr (0 ) 1 2 +ln Pr ( 1 ) h i ln 1 2 + ³ Pr(0 ) Pr( 1 ) = ln μ Pr (0 ) Pr ( 1 ) + 2 0 (If true accept 0 ) 1 2 As gets big cut o changes with sample size unless Pr( 0 )= Pr( 1 )= 1 2 Notice that this is dierent from the classical statistical rule of a fixed cuto point. 79

Compare this to the fixed rule in example 3. This is a Bayesian version of example 3 with 0 : =0 instead of : =1. Instead of 0 : = 1 vs. : = 1. Note that we get the same rule in both cases when ( 0 )= ( 1 )= 1 2. 80

Point Null vs. Composite Alternative Same set up as in previous case: ( 2 ). 0 : =0vs. : 6= 0. 2 is unspecified, but common across models. Turn Bayes Crank. Likelihood factor: ( ; 2 ) ( ;0 2 ) Relative likelihoods L =exp 2 2 h 2 2 i 81

What value of is best supported by data? Recall the likelihood approach: (Focuses on outcomes that are most likely.) L =exp 2 (2 ) 2 82

Relative Likelihood 1.8 1.6 exp([y 2 - (Y- ) 2 ].T/(2 2 )) 1.4 1.2 1 0.8 0.6 0.4 0.2 0-3 -2-1 0 1 2 3 4 5 ; Y = 1, = 1, T = 1 83

value approach uses absolute likelihood not relative likelihood. In what sense is it most likely? Likelihood approach: Evaluate at null of =0and we get: L =exp 2 2 2 =1 2 2 2 =1 1 2 q 2 2 {z } 2 for =0 This is an expression of support for the hypothesis: =0. Thus a big value leads to rejection of the null. But this approach does not worry about the alternative., 84

Frequency Theory or Sampling Approach. Look at sampling distributions of model Test statistic : centered at =0 () =Pr =0 e.g. 196 we reject. value: knife-edge value is the value that occurred value that favors null? At any level less than, nullhypothesisisnot rejected. 85

Significance level: Is what occurred unlikely? Relative likelihood is a relative idea (null vs. alternative). Support for one hypothesis vs. support for another. 87

Bayes Approach: Allocate positive probability to null. (Otherwise the probability of a point null =0). if =0 () (1 ) 0 ( ) 1 if 6= 0 {z } (0 1 ) 88

(Point mass): = Pr 1 Pr 0 = Z ³ (1 ) R 1 2 2 2 () =02 h exp 2 6=0 ³ h 1 2 2 exp 2 i i exp h ()2 2 2 2 i 2 2 Complete the square in the numerator and integrate out. 89

Side manipulations: Look at numerator exp 2 ( 2 2 + 2 ) 2 2 2 90

Complete the square to reach: exp μ 2 exp 2 2 2 + 2 2 2 2 2 2 ³ 2 = μ + 1 2 2 exp 1 2 2 2 +. 2 exp 2 2 2 exp 1 2 + 2 1 2 2 μ + 2 2 Ã 2 2 2 +! + Ã 2 2 +! 2 91

Then integrate out the and we get (cancelling terms): = 1 0 = 1 1 " μ 2 exp μ + 2 1 2 2 μ 1 2 Ã!# 2 + 2 μ 1+ Ã 1! 2 μ Ã! 2 exp 1 1 2 2 1+ 2 {z } Bayes factor 92

= 1 Ã 1 1+ 2! 1 2 exp " 2 2 Ã 1 1+ 2!# Notice that the higher ³ =,themorelikelywereject 0. However, as, for fixed, we get Pr 1 Notice t = for =0;thisis (1). Pr 0 0. 93

We support 0 ( Lindley Paradox ) Bayesians use sample size to adjust critical region or rejection region. In classical case, we have that with fixed, the power of the test goes to 1. (It overweights the null hypothesis.) Issue: which weighting of and is better? 94

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006