Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

Similar documents
STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

(1) Introduction to Bayesian statistics

Frequentist Statistics and Hypothesis Testing Spring

Time Series and Dynamic Models

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

ORF 245 Fundamentals of Statistics Chapter 9 Hypothesis Testing

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Discrete Dependent Variable Models

Introduction to Bayesian Statistics

Testing Simple Hypotheses R.L. Wolpert Institute of Statistics and Decision Sciences Duke University, Box Durham, NC 27708, USA

Computational Perception. Bayesian Inference

Notes on Decision Theory and Prediction

Physics 403. Segev BenZvi. Credible Intervals, Confidence Intervals, and Limits. Department of Physics and Astronomy University of Rochester

Mathematical Statistics

Bayesian Econometrics

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Chapter Three. Hypothesis Testing

Quantitative Understanding in Biology 1.7 Bayesian Methods

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Unobservable Parameter. Observed Random Sample. Calculate Posterior. Choosing Prior. Conjugate prior. population proportion, p prior:

Bayesian Statistics. State University of New York at Buffalo. From the SelectedWorks of Joseph Lucke. Joseph F. Lucke

Some Asymptotic Bayesian Inference (background to Chapter 2 of Tanner s book)

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Labor Supply and the Two-Step Estimator

Eco517 Fall 2004 C. Sims MIDTERM EXAM

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

10. Exchangeability and hierarchical models Objective. Recommended reading

Statistics 100A Homework 5 Solutions

F79SM STATISTICAL METHODS

Probability and Estimation. Alan Moses

MS&E 226: Small Data

Primer on statistics:

Review of Statistics

Institute of Actuaries of India

Ability Bias, Errors in Variables and Sibling Methods. James J. Heckman University of Chicago Econ 312 This draft, May 26, 2006

Bayesian RL Seminar. Chris Mansley September 9, 2008

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Bayesian Learning (II)

Stat 5101 Lecture Notes

Part III. A Decision-Theoretic Approach and Bayesian testing

Lecture Notes 1: Decisions and Data. In these notes, I describe some basic ideas in decision theory. theory is constructed from

Bayesian Analysis for Natural Language Processing Lecture 2

Hypothesis Test. The opposite of the null hypothesis, called an alternative hypothesis, becomes

COS513 LECTURE 8 STATISTICAL CONCEPTS

Principles of Statistical Inference

Bayesian Methods for Machine Learning

Principles of Statistical Inference

Lecture notes on statistical decision theory Econ 2110, fall 2013

Fundamental Probability and Statistics

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

CHAPTER 2 Estimating Probabilities

Machine Learning

g-priors for Linear Regression

CS 361: Probability & Statistics

Lecture Testing Hypotheses: The Neyman-Pearson Paradigm

Hypothesis Testing - Frequentist

6.4 Type I and Type II Errors

Comparison of Bayesian and Frequentist Inference

Bayesian Models in Machine Learning

Class 26: review for final exam 18.05, Spring 2014

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Why Try Bayesian Methods? (Lecture 5)

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

The Bayesian Paradigm

Using Probability to do Statistics.

Linear Models A linear model is defined by the expression

18.05 Practice Final Exam

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions

Statistical Methods in Particle Physics

review session gov 2000 gov 2000 () review session 1 / 38

Introduction to Machine Learning. Lecture 2

1 Hypothesis Testing and Model Selection

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

Probability theory basics

3 Joint Distributions 71

Introductory Econometrics. Review of statistics (Part II: Inference)

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

exp{ (x i) 2 i=1 n i=1 (x i a) 2 (x i ) 2 = exp{ i=1 n i=1 n 2ax i a 2 i=1

Maximum-Likelihood Estimation: Basic Ideas

A Note on Hypothesis Testing with Random Sample Sizes and its Relationship to Bayes Factors

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Bayesian inference. Justin Chumbley ETH and UZH. (Thanks to Jean Denizeau for slides)

Bayesian inference J. Daunizeau

Inverse Sampling for McNemar s Test

Test Code: STA/STB (Short Answer Type) 2013 Junior Research Fellowship for Research Course in Statistics

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Statistics for the LHC Lecture 1: Introduction

Maximum Likelihood (ML) Estimation

Computational Cognitive Science

The University of Hong Kong Department of Statistics and Actuarial Science STAT2802 Statistical Models Tutorial Solutions Solutions to Problems 71-80

Statistical Distribution Assumptions of General Linear Models

7. Estimation and hypothesis testing. Objective. Recommended reading

Bios 6649: Clinical Trials - Statistical Design and Monitoring

Bayesian Statistics. Debdeep Pati Florida State University. February 11, 2016

Foundations of Statistical Inference

Transcription:

Hypothesis Testing Part I James J. Heckman University of Chicago Econ 312 This draft, April 20, 2006 1

1 A Brief Review of Hypothesis Testing and Its Uses values and pure significance tests (R.A. Fisher) focus on null hypothesis testing. Neyman-Pearson tests focus on null and alternative hypothesis testing. Both involve an appeal to long run trials. They adopt an ex ante position (justify a procedure by the number of times it is successful if used repeatedly). 2

2 Pure Significance Tests Focuses exclusively on the null hypothesis Let ( 1 ) be observations from a sample. Let ( 1 ) be a test statistic. If (1) We know the distribution of ( ) under 0,and (2) The larger the value of ( ), the more the evidence against 0, then =Pr( : 0 ). This is evidence against the null hypothesis. 3

Observethatthe valueisauniform(0 1) variable. For random variable with density (absolutely continuous with Lebesgue measure) = () is uniform for any given that is continuous. (Prove this for yourself. It is automatic from the definition.) value probability that would occur given that 0 is a true state of aairs. test or test for a regression coecient is an example; The higher the test statistic, the more likely we reject. Ignores any evidence on alternatives. 4

R.A. Fisher liked this feature because it did not involve speculation about other possibilities than the one realized. values make an absolute statement about a model. 5

Questions I consider: (1) How to conduct a best test? Compare alternative tests. Any monotonic transformation of the statistic produces the same value. (2) Pure significance tests depend on the sampling rule used. This is not necessarily bad. (3) How to pool across studies (or across coecients)? 6

2.1 Bayesian vs. Frequentist or Classical Approach ISSUES: (1) In what sense and how well do significance levels or values summarize evidence in favor of or against hypotheses? (2) Do we always reject a null in a big enough sample? Meaningful hypothesis testing Bayesian or Classical requires that significance levels decrease with sample size; (3) Two views: =0tests something meaningful vs. =0 only an approximation, shouldn t be taken too seriously. 7

(4) How to test non-nested models (Cox-non-nested test and extensions) Classical form: Set 0 = 1 = 0 : = 0 0 + 0 1 : = 1 1 + 1 Test = = 0 0 + 1 1 + Test vs. 0 (this is asymmetric) and is not hypothesis of interest. Theil 2 test (Its asymmetric in stating the null) See Leamer Chapter 3. 8

(5) How to quantify evidence about model? (How to incorporate prior restrictions?) What is strength of evidence? (6) How to account for model uncertainty: fishing, etc. First consider the basic Neyman-Pearson structure- then switch over to a Bayesian paradigm. 9

Useful to separate out: (1) Decision problems From (2) Acts of data description. This is a topic of great controversy in statistics. 10

Question: In what sense does increasing sample size always lead to rejection of an hypothesis? If null not exactly true, we get rejections (The power of test 1 for fixed sig. level as sample size increases) 11

Example to refresh your memory about Neyman-Pearson Theory. Take one-tail normal test about a mean: What is the test? Assume 2 is known. 0 : 0 2 : 2 12

For any we get Pr à 0 p 2 0 p 2! = () (Exploit symmetry of standard normal around the origin). For a fixed, wecansolvefor (). () = 0 1 () 13

Now what is the probability of rejecting the hypothesis under alternatives? (The power of a test). Let be the alternative value of.fix to have a certain size. (Use the previous calculations) Pr à p 2 p 2 =Pr! ³ 0 ³ 1 (). We are evaluating the probability of rejection when we allow to vary. 14

Thus = Pr ³ 0 ³ 1 () = when 0 = If 0, this probability goes to one. This is a consistent test. 15

Now, suppose we seek to test 0 : 0.Use fixed If 0 is true: ³ 0 0 ³ The distribution becomes more and more concentrated at 0. We reject the null unless 0 =. 16

Power of the test for T elements of Data 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Prob[mean(X ) > c ( )] 1 0 Prob[mean(X ) > c ( )] 2 0 Prob[mean(X ) > c ( )] 10 0 Prob[mean(X ) > c ( )] 100 0 Prob[mean(X ) > c ( )] 1000 0 0 0 1 2 3 4 5 True value ( = 0.05, = 1, =0) A 0 17

Parenthetical Note: Observe that if we measure with the slightest error and the errors do not have mean zero, we always reject 0 for big enough. 18

Design of Sample size Suppose that we fix the power =. Pick (). What sample size produces the desired power? We postulate the alternative = 0 +. 19

Pr ³ 0 ³ 1 () = Ã 1 ()+ 0! = 1 () = 1 ()+ 0 [ 1 () 1 ()] = 20

Minimum needed to reject null at specified alternative. Has power of for eect size. Pick sample size on this basis: (This is used in sample design) What value of to use? 21

Observe that two investigators with same but dierent sample size have dierent power. This is often ignored in empirical work. Why not equalize the power of the tests across samples? Why use the same size of test in all empirical work? 22

3 Alternative Approaches to Testing and Inference 3.1 Classical Hypothesis Testing (1) Appeals to long run frequencies. (2) Designs an ex ante rule that on average works well. e.g. 5% ofthetimeinrepeatedtrialswemakeanerror of rejecting the null for a 5% significance level. (3) Entails a hypothetical set of trials, and is based on a long run justification. 23

(4) Consistency of an estimator is an example of this mindset. E.g., = + ( ) 6= 0; OLS biased for. Suppose we have an instrument: ( ) =0 ( ) 6= 0 ( ) plim = + () ( ) plim = + = ( ) {z } =0 24

Another consistent estimator (1) Use for first 10 100 observations (2) Then use IV Likely to have poor small sample properties. But on a long run frequency justification, its just fine. 25

3.2 Examples of why some people get very unhappy about classical procedures Example 1. Sample size: =2 ( 1 2 ) 1 2 0 ( = 0 1) = ( = 0 +1)= 1 ; =12 2 One possible (smallest) confidence set for 0 is ( 1 2 )= ½ 1 2 ( 1 + 2 ) if 1 6= 2 1 1 if 1 = 2 26

Thus 75% of the time ( 1 2 ) contains 0 (75% of repeated trials it covers 0 ). (Verify this) Yet if 1 6= 2,wearecertain that the confidence interval exactly covers the true value. 100% ofthetimeitisright. Ex post or conditional on the data, we get the exact value. 27

Example 2. (D.R. Cox) (1) You have data, say on DNA from crime scenes. (2) You can send data to New York or California labs. Both labs seem equally good. (3) Toss a coin to decide which lab analyzes data. Should the coin flip be accounted for in the design of the test statistic? 28

Example 3. Test H 0 : θ = 1 vs. H A : θ =1. X N (θ,.25) Consider rejection region: Reject if X 0. If we observe X =0,wewouldhaveα =.0228 Size is.0228; power under alternative is.9772. Looks good, but if we reverse roles of null and alternative, it would also look good.

Probability Density Function Example 3 X N,0.25; H 0 : 1; H A : 1. Test : Reject H 0 if X 0. φ((x- μ 0 )/σ) 0.5 φ((x- μ 1 )/σ) 0.4 0.3 0.2 0.1 0-4 -3-2 -1 0 1 2 3 4 x, μ A = 1, μ 0 = -1, σ 2 = 0.25 In the case of 0 being observed: Power 0. 0228 30

Example 4. {1 2 3}; wehavetwopossibleoutcomes 0 and 1 1 2 3 0 009 001 99 1 001 989 01 Consider test: Accepts 0 when =3and accepts 1 otherwise ( = 01 and = 99 high power). 31

If we observe =1we reject. But the likelihood ratio in favor of 0 is 009 001 =9 32

Example 5. 1 2 3 0 005 005 99 1 0051 9489 01 Reject 0 when =1 2 Power = 99, Size= 01. Is it reasonable to pick 1 over 0 when =1is chosen? (Likelihood ratio not strongly supporting the hypothesis) 33

Example 6. (Lindley and Phllips; American Statistician, August, 1976). Consider an experiment. We draw 12 balls from an urn. The urn has an infinite number of balls. = probability of black. (1 ) =probability of red. 34

Red and black are equally likely on each trial and trials are independent. μ 12 Pr ( is black) = (1 ) 12. Suppose that we draw 9 black balls and 3 red balls. What is the evidence in support of the hypothesis that = 1 2? 35

We might choose a critical region = {9 10 11 12} to reject null of = 1 2 = Pr( {9 10 11 12}) ½μ μ μ 12 12 12 = + + 3 2 1 We do not reject for = 05 + μ ¾ μ 12 12 1 =75% 0 2 This sampling distribution assumes that 10 black and 2 reds is a possibility. (It is based on a counterfactual space of what else could occur and with what possibility). 36

Now consider an alternative sampling rule. Draw balls until 3 red balls are observed and then stop. (So 10 blacks and 2 reds on a trial of 12 not possible as they were before.) Distribution of 2 ( in this experiment) is μ 2 +2 2 (1 ) 3 2 Rejection region 2 = {9 10 11 12 13} i.e., if 2 9, reject Pr( {9 10 11 12 13}) =325% Now significant. Reject null of equality. 37

Now suppose you have in both cases 9 black and 3 red on a single trial. 38

In computing values and significance levels, you need to model what didn t occur. Depends on the stopping rule and the hypothetical admissible sample space. 39

3.3 Additional Problems with Classical Statistics Classical: Design of what could happen. Aects values. Trials with 3 results 1, 2, 3 (from Hacking) Hypotheses (1) (2) (3) vs. 001 095 004 vs. 0 095 005 vs. 000001 095 004999 Reject for i 1 occurs. This is not the most powerful test in fact, it is a biased test. (size = 01 forlessthanpower=0). 40

Power (from Hacking) (1) (2) (3) (4) 0 001 001 098 001 001 097 001 Test 1: reject if 3 occurs; size =001, power=097. Test 2: reject i 1 or 2 occur; power= 002. 2 is inferior to 1 in power, but if 1 occurs, reject and then we know it has to be. 2 is a much better ex post test than 1. 41

3.4 Likelihood Principle All of the information is in the sample. Look at the likelihood as best summary of the sample. 42

Likelihood Approach Recall from our discussion of asymptotics Part III that under the regularity conditions of Thms. 2 and 3, because ˆ (ˆ) = ( 0 )+ 1 2 (ˆ 0 ) 0 2 0 = = 0 for all ˆ; ( 0 ) = ln $( 0) where in the i.i.d. case: ln $(ˆ) 0 (ˆ 0 )+ (1) 43

In terms of the information matrix, for the likelihood case (ˆ) =( 0 ) 1 2 (ˆ 0 ) 0 0 (ˆ 0 )+ (1) So we know that as, the likelihood $ looks like a normal, e.g., 1 N ( ) = exp 1 (2) 2 2 ( )0 1 ( ) ln N ( ) = ln(2) 2 ln 1 2 ( )0 1 ( ) 44

So the likelihood is converging to a normal-looking criterion andhasitsmodeat 0. The most likely value is at the MLE estimator (mode of likelihood is 0 ). 45

3.5 Bayesian Principle Use prior information in conjunction with sample information. Place priors on parameters. Classical Method and Likelihood Principle sharply separate parameters from data (random variables). The Bayesian method does not. All parameters are random variables. 46

Bayesian and Likelihood approach both use likelihood. Likelihood: Use data from experiment (Evidence concentrates on 0 ) Bayesian: Use data from experiment plus prior. 47

Bayesian Approach postulates a prior (). This is a probability of Compute using posterior (Bayes Theorem): posterior z } { ( ) =TL( ) {z } likelihood prior z} { (), where T is a constant defined so posterior integrates to 1. Get some posterior independent of constants (and therefore sampling rule). 48

3.6 Definetti s Thm: Let denote a binary variable {0 1} i.i.d. Pr( =1)= Pr( =0)=1 Let ( ) =probability of = 1 and 0. Z 1 μ + ( ) = (1 ) () 0 For some () 0 (This is just the standard Hausdor moment problem). 49

For this problem a natural conjugate prior is () = 1 (1 ) 1 ( ) 0 1 = =1, uniform () = + 50

Beta Probability Density Function 2.5 2 1.5 b = 0.1 b = 0.5 b = 1 b = 2 b = 3 b = 5 b = 10 1 0.5 0 0 0.2 0.4 0.6 0.8 1 of Beta(,a,b), a =0.1 51

Beta Probability Density Function 5 4.5 4 3.5 3 b = 0.1 b = 0.5 b = 1 b = 2 b = 3 b = 5 b = 10 2.5 2 1.5 1 0.5 0 0 0.2 0.4 0.6 0.8 1 of Beta(,a,b), a =0.5 52

Beta Probability Density Function 10 9 8 7 6 b = 0.1 b = 0.5 b = 1 b = 2 b = 3 b = 5 b = 10 5 4 3 2 1 0 0 0.2 0.4 0.6 0.8 1 of Beta(,a,b), a =1 53

Beta Probability Density Function 10 9 8 7 6 b = 0.1 b = 0.5 b = 1 b = 2 b = 3 b = 5 b = 10 5 4 3 2 1 0 0 0.2 0.4 0.6 0.8 1 of Beta(,a,b), a =2 54

Beta Probability Density Function 10 9 8 7 6 b = 0.1 b = 0.5 b = 1 b = 2 b = 3 b = 5 b = 10 5 4 3 2 1 0 0 0.2 0.4 0.6 0.8 1 of Beta(,a,b), a =3 55

Beta Probability Density Function 10 9 8 7 6 b = 0.1 b = 0.5 b = 1 b = 2 b = 3 b = 5 b = 10 5 4 3 2 1 0 0 0.2 0.4 0.6 0.8 1 of Beta(,a,b), a =5 56

Beta Probability Density Function 10 9 8 7 6 b = 0.1 b = 0.5 b = 1 b = 2 b = 3 b = 5 b = 10 5 4 3 2 1 0 0 0.2 0.4 0.6 0.8 1 of Beta(,a,b), a =10 57

Beta Probability Density Function 10 9 8 7 6 5 4 3 b = 1 b = 2 b = 3 b = 4 b = 5 b = 6 b = 10 5 4.5 4 3.5 3 2.5 2 1.5 1 2 1 0 0 0.2 0.4 0.6 0.8 1 of Beta(,a,b), a =1 0.5 0 3 2.5 2 1.5 1 0.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 of Beta(,,)X ; ( = 1) ( = 1) 1 58

Beta Probability Density Function 10 9 8 7 6 5 4 b = 1 b = 2 b = 3 b = 4 b = 5 b = 6 b = 10 8 6 4 2 3 2 1 0 0 0.2 0.4 0.6 0.8 1 of Beta(,a,b), a =2 0 1 0.8 0.6 0.4 0.2 of Beta(,a,b) ; (a = 2) 0 3 2.5 2 1.5 b (a = 2) 1 0.5 59

Beta Probability Density Function 10 9 8 7 6 b = 1 b = 2 b = 3 b = 4 b = 5 b = 6 b = 10 5 4 3 2 1 0 0 0.2 0.4 0.6 0.8 1 of Beta(,a,b), a =1 60

5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 3 2.5 2 ( = 1) 1.5 1 0.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 of Beta(,,)X ; ( = 1) 1 61

Beta Probability Density Function 10 9 8 7 6 b = 1 b = 2 b = 3 b = 4 b = 5 b = 6 b = 10 5 4 3 2 1 0 0 0.2 0.4 0.6 0.8 1 of Beta(,a,b), a =2 62

8 6 4 2 0 1 0.8 0.6 0.4 0.2 of Beta(,a,b) ; (a = 2) 0 3 2.5 2 1.5 b (a = 2) 1 0.5 63

Posterior ( ) = (1 ) {z } likelihood 1 (1 ) 1 {z } prior where is the data and is a normalizing constant to make density normalize to one: Z (1 ) 1 (1 ) 1 =1 Observe crucially that the normalizing constant is the same for both sampling rules we discussed in the red ball and black ball problem. Why? Becausewechoose to make ( ) integrate to one. 64

Mean of posterior with prior posterior () = + ( + )+( + ) Notice: the constants that played such a crucial role in the sampling distribution play no role here. They vanish in defining the constant. mode of = 1 ( 1) + ( 1) Likelihood corresponds to ( + ) trialswith red and black. Prior corresponds to ( + 2) trialswith( 1) redand ( 1) black. 65

Empirical Bayes Approach Estimate Prior. Go to Beta-Binomial Example. ( ) = Z 1 0 + (1 ) 1 (1 ) 1 ( + ) Now is a heterogeneity parameter distributed ( ). = + ( + 1+ 1) ( + ) Estimate and as parameters from a string of trials with reds and blacks. is a person-specific parameter. 66

Similar idea in the linear regression model = +.(Use notes from the second day of classes for proof of identification.) (We can identify means and variances.) = + ( ) Assume. = + ( )= = +( + ) {z } 2 = 2 + 0 Use squared OLS residuals to identify given. 67

Notice: We can extend the model to allow = + and identify (Hierarchical model). 68

Digression: Take the Classical Normal Linear Regression Model = + ( 0 )= 2 ˆ =( 0 ) 1 0 (ˆ) = 2 ( 0 ) 1 Assume 2 known. Take a prior on. N 2 () 1 Posterior is normal: posterior ³ N +( 0 ) 1 ³ 1 +( )ˆ 0 2 ( + 0 ) 1 69

Thus, we can think of the prior as a sample of observations with the ( 0 ) matrixbeing and the sample OLS from prior being. Compare to = + OLS is ( 0 + 0 ) 1 ( 0 + 0 ) =( 0 ) 1 0 =( 0 ) 1 0 (Prove this.) See Leamer for more general case where 2 is unknown (gamma prior). 70

To compute evidence on one hypothesis vs. another hypothesis use posterior odds ratio Pr( 1 ) Pr( 0 ) = Pr( 1) Pr( 0 ) Pr( 1 ) Pr( 0 ) Hypotheses are restrictions on the prior (e.g. dierent values of ( )) e.g. uniform vs. is equally likely. 71

Classical Approach Bayesian Approach Assumption regarding experiment Events independent, Events form exchangeable sequences given a probability Interpretation of probability Relative frequency; Degrees of belief; applies only to repeated events applies both to unique and to sequences of events Statistical inferences Based on sampling distribution; Based on posterior distribution; sample space or stopping rule prior distribution must be assessed must be specified Estimates of parameters Requires theory of estimation Descriptive statistics of the posterior distribution Intuitive judgement Used in setting significance levels, Formally incorporated in the in choice of procedure, and in other prior distribution ways Source: Lindley, D.V. and Phillips, L.D. (1976). Inference for a Bernoulli Process (A Bayesian View). American Statistician 30(3): 112-119 72

4 Bayesian Testing Point null vs. Point Alternative test (Leamer Example): Think of a regression model = 1 + 1 vs. = 0 + 0 2 Hypotheses: 1 0 Posterior odds ratio Pr ( 1 ) Pr ( 0 ) = Bayes factor Pr ( 1 ) Pr ( 0 ) Prior odds ratio Pr ( 1 ) Pr ( 0 ) 73

Predictive density : Z Z ( )= 2 2 Likelihood 2 Prior density Evidence supports the higher posterior probability model. 74

Example: ; 2 ; 2 0 : 0 =0 =1 1 : 1 =1 =1 0 : (0 1 ) 1 : (1 1 ) 75

Typical Neyman-Pearson Rule: Reject 0 if Accept 0 if Type 1 and Type 2 errors: () = Pr =0 () = Pr =1 Example: =05, = =031. 76

Bayes Approach Pr 0 = 0 Pr (0 ) = 0 Pr (0 ) 0 Pr (0 )+ 1 Pr (1 ) Pr 1 = 1 Pr (1 ) 0 Pr (0 )+ 1 Pr (1 ) 77

Pr 0 Pr 1 = 0 Pr (0 ) 1 Pr (1 ) = exp 1 h 2 + i 2 Pr ( 0 ) 1 2 Pr ( 1 ) = exp 1 2 2 + 2 Pr ( 0 ) 2 Pr ( 1 ) = exp 1 Pr ( 0 ) 2 2 Pr ( 1 ) (Recall 2 =1under null and alternatives.) 78

2 Ã Pr 0! ln Pr 1 μ Pr (0 ) 1 2 +ln Pr ( 1 ) h i ln 1 2 + ³ Pr(0 ) Pr( 1 ) = ln μ Pr (0 ) Pr ( 1 ) + 2 0 (If true accept 0 ) 1 2 As gets big cut o changes with sample size unless Pr( 0 )= Pr( 1 )= 1 2 Notice that this is dierent from the classical statistical rule of a fixed cuto point. 79

Compare this to the fixed rule in example 3. This is a Bayesian version of example 3 with 0 : =0 instead of : =1. Instead of 0 : = 1 vs. : = 1. Note that we get the same rule in both cases when ( 0 )= ( 1 )= 1 2. 80

Point Null vs. Composite Alternative Same set up as in previous case: ( 2 ). 0 : =0vs. : 6= 0. 2 is unspecified, but common across models. Turn Bayes Crank. Likelihood factor: ( ; 2 ) ( ;0 2 ) Relative likelihoods L =exp 2 2 h 2 2 i 81

What value of is best supported by data? Recall the likelihood approach: (Focuses on outcomes that are most likely.) L =exp 2 (2 ) 2 82

Relative Likelihood 1.8 1.6 exp([y 2 - (Y- ) 2 ].T/(2 2 )) 1.4 1.2 1 0.8 0.6 0.4 0.2 0-3 -2-1 0 1 2 3 4 5 ; Y = 1, = 1, T = 1 83

value approach uses absolute likelihood not relative likelihood. In what sense is it most likely? Likelihood approach: Evaluate at null of =0and we get: L =exp 2 2 2 =1 2 2 2 =1 1 2 q 2 2 {z } 2 for =0 This is an expression of support for the hypothesis: =0. Thus a big value leads to rejection of the null. But this approach does not worry about the alternative., 84

Frequency Theory or Sampling Approach. Look at sampling distributions of model Test statistic : centered at =0 () =Pr =0 e.g. 196 we reject. value: knife-edge value is the value that occurred value that favors null? At any level less than, nullhypothesisisnot rejected. 85

86

Significance level: Is what occurred unlikely? Relative likelihood is a relative idea (null vs. alternative). Support for one hypothesis vs. support for another. 87

Bayes Approach: Allocate positive probability to null. (Otherwise the probability of a point null =0). if =0 () (1 ) 0 ( ) 1 if 6= 0 {z } (0 1 ) 88

(Point mass): = Pr 1 Pr 0 = Z ³ (1 ) R 1 2 2 2 () =02 h exp 2 6=0 ³ h 1 2 2 exp 2 i i exp h ()2 2 2 2 i 2 2 Complete the square in the numerator and integrate out. 89

Side manipulations: Look at numerator exp 2 ( 2 2 + 2 ) 2 2 2 90

Complete the square to reach: exp μ 2 exp 2 2 2 + 2 2 2 2 2 2 ³ 2 = μ + 1 2 2 exp 1 2 2 2 +. 2 exp 2 2 2 exp 1 2 + 2 1 2 2 μ + 2 2 Ã 2 2 2 +! + Ã 2 2 +! 2 91

Then integrate out the and we get (cancelling terms): = 1 0 = 1 1 " μ 2 exp μ + 2 1 2 2 μ 1 2 Ã!# 2 + 2 μ 1+ Ã 1! 2 μ Ã! 2 exp 1 1 2 2 1+ 2 {z } Bayes factor 92

= 1 Ã 1 1+ 2! 1 2 exp " 2 2 Ã 1 1+ 2!# Notice that the higher ³ =,themorelikelywereject 0. However, as, for fixed, we get Pr 1 Notice t = for =0;thisis (1). Pr 0 0. 93

We support 0 ( Lindley Paradox ) Bayesians use sample size to adjust critical region or rejection region. In classical case, we have that with fixed, the power of the test goes to 1. (It overweights the null hypothesis.) Issue: which weighting of and is better? 94