What are the Findings?

Similar documents
Computational Perception. Bayesian Inference

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

(1) Introduction to Bayesian statistics

Bayesian RL Seminar. Chris Mansley September 9, 2008

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

A primer on Bayesian statistics, with an application to mortality rate estimation

Bayesian Estimation An Informal Introduction

Bayesian Methods: Naïve Bayes

A Brief Review of Probability, Bayesian Statistics, and Information Theory

Chapter 5. Bayesian Statistics

Bayesian Analysis (Optional)

ORF 245 Fundamentals of Statistics Chapter 9 Hypothesis Testing

PHASES OF STATISTICAL ANALYSIS 1. Initial Data Manipulation Assembling data Checks of data quality - graphical and numeric

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

Introduction to Machine Learning CMU-10701

Compute f(x θ)f(θ) dθ

ECE521 W17 Tutorial 6. Min Bai and Yuhuai (Tony) Wu

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

STA 247 Solutions to Assignment #1

P (A B) P ((B C) A) P (B A) = P (B A) + P (C A) P (A) = P (B A) + P (C A) = Q(A) + Q(B).

CSC321 Lecture 18: Learning Probabilistic Models

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

The devil is in the denominator

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

Quantitative Understanding in Biology 1.7 Bayesian Methods

CHAPTER 2 Estimating Probabilities

Computational Cognitive Science

CS 361: Probability & Statistics

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Statistical Methods in Particle Physics. Lecture 2

Bayesian Inference and MCMC

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Conjugate Priors: Beta and Normal Spring 2018

Some Basic Concepts of Probability and Information Theory: Pt. 2

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Chapter Three. Hypothesis Testing

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Bayesian Models in Machine Learning

Intro to Probability. Andrei Barbu

Conjugate Priors: Beta and Normal Spring 2018

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Principles of Bayesian Inference

Introduction to Statistical Methods for High Energy Physics

Maximum-Likelihood Estimation: Basic Ideas

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Machine Learning

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Frequentist Statistics and Hypothesis Testing Spring

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Lecture 1: Probability Fundamentals

Principles of Bayesian Inference

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Parametric Techniques Lecture 3

Statistical Methods for Particle Physics Lecture 1: parameter estimation, statistical tests

Bayesian hypothesis testing

Introduc)on to Bayesian Methods

Advanced Probabilistic Modeling in R Day 1

Parametric Techniques

Recursive Estimation

Comparison of Bayesian and Frequentist Inference

Statistical Methods in Particle Physics Lecture 1: Bayesian methods

Principles of Bayesian Inference

Principles of Bayesian Inference

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

Computational Cognitive Science

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

SAMPLE CHAPTER. Avi Pfeffer. FOREWORD BY Stuart Russell MANNING

Bayes Formula. MATH 107: Finite Mathematics University of Louisville. March 26, 2014

Inference Control and Driving of Natural Systems

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Naïve Bayes classification

Physics 403. Segev BenZvi. Choosing Priors and the Principle of Maximum Entropy. Department of Physics and Astronomy University of Rochester

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

Bayesian Learning (II)

Lecture 23 Maximum Likelihood Estimation and Bayesian Inference

Machine Learning

Statistical Data Analysis Stat 3: p-values, parameter estimation

Basics of Statistical Estimation

Probability Theory Review

Hypothesis Testing. File: /General/MLAB-Text/Papers/hyptest.tex

Language as a Stochastic Process

Probability and Estimation. Alan Moses

Classical and Bayesian inference

Introduction to Probabilistic Machine Learning

MATH 19B FINAL EXAM PROBABILITY REVIEW PROBLEMS SPRING, 2010

CENTRAL LIMIT THEOREM (CLT)

2011 Pearson Education, Inc

Intro to Bayesian Methods

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Discrete Probability Distribution Tables

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Introduction to Bayesian Statistics. James Swain University of Alabama in Huntsville ISEEM Department

Data Analysis and Monte Carlo Methods

CS 188: Artificial Intelligence Spring Today

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Transcription:

What are the Findings? James B. Rawlings Department of Chemical and Biological Engineering University of Wisconsin Madison Madison, Wisconsin April 2010 Rawlings (Wisconsin) Stating the findings 1 / 33

Why look at this problem? Hypothesis testing (is it a fair coin?) Confidence intervals (assign probability to estimate) Quantifying information gained from measurement Conditional probability Bayesian estimation Intuition requires experience. This problem provides experience It s a fun problem! Rawlings (Wisconsin) Stating the findings 2 / 33

Why look at this problem? Hypothesis testing (is it a fair coin?) Confidence intervals (assign probability to estimate) Quantifying information gained from measurement Conditional probability Bayesian estimation Intuition requires experience. This problem provides experience It s a fun problem! Rawlings (Wisconsin) Stating the findings 2 / 33

Why look at this problem? Hypothesis testing (is it a fair coin?) Confidence intervals (assign probability to estimate) Quantifying information gained from measurement Conditional probability Bayesian estimation Intuition requires experience. This problem provides experience It s a fun problem! Rawlings (Wisconsin) Stating the findings 2 / 33

Why look at this problem? Hypothesis testing (is it a fair coin?) Confidence intervals (assign probability to estimate) Quantifying information gained from measurement Conditional probability Bayesian estimation Intuition requires experience. This problem provides experience It s a fun problem! Rawlings (Wisconsin) Stating the findings 2 / 33

Why look at this problem? Hypothesis testing (is it a fair coin?) Confidence intervals (assign probability to estimate) Quantifying information gained from measurement Conditional probability Bayesian estimation Intuition requires experience. This problem provides experience It s a fun problem! Rawlings (Wisconsin) Stating the findings 2 / 33

Why look at this problem? Hypothesis testing (is it a fair coin?) Confidence intervals (assign probability to estimate) Quantifying information gained from measurement Conditional probability Bayesian estimation Intuition requires experience. This problem provides experience It s a fun problem! Rawlings (Wisconsin) Stating the findings 2 / 33

Why look at this problem? Hypothesis testing (is it a fair coin?) Confidence intervals (assign probability to estimate) Quantifying information gained from measurement Conditional probability Bayesian estimation Intuition requires experience. This problem provides experience It s a fun problem! Rawlings (Wisconsin) Stating the findings 2 / 33

Is it a fair coin? I am given a coin I want to know if it is a fair coin So I flip it 1000 times and observe 527 heads What can I conclude? Rawlings (Wisconsin) Stating the findings 3 / 33

Is it a fair coin? I am given a coin I want to know if it is a fair coin So I flip it 1000 times and observe 527 heads What can I conclude? Rawlings (Wisconsin) Stating the findings 3 / 33

Is it a fair coin? I am given a coin I want to know if it is a fair coin So I flip it 1000 times and observe 527 heads What can I conclude? Rawlings (Wisconsin) Stating the findings 3 / 33

Is it a fair coin? I am given a coin I want to know if it is a fair coin So I flip it 1000 times and observe 527 heads What can I conclude? Rawlings (Wisconsin) Stating the findings 3 / 33

Standard model of a coin The coin is characterized by a parameter value, θ The probability of flipping a heads is θ and tails is 1 θ The fair coin has parameter value θ = 1/2 The parameter is a feature of the coin and does not change with time All flips of the coin are independent events, i.e. they are uninfluenced by the outcomes of other flips. Rawlings (Wisconsin) Stating the findings 4 / 33

Standard model of a coin The coin is characterized by a parameter value, θ The probability of flipping a heads is θ and tails is 1 θ The fair coin has parameter value θ = 1/2 The parameter is a feature of the coin and does not change with time All flips of the coin are independent events, i.e. they are uninfluenced by the outcomes of other flips. Rawlings (Wisconsin) Stating the findings 4 / 33

Standard model of a coin The coin is characterized by a parameter value, θ The probability of flipping a heads is θ and tails is 1 θ The fair coin has parameter value θ = 1/2 The parameter is a feature of the coin and does not change with time All flips of the coin are independent events, i.e. they are uninfluenced by the outcomes of other flips. Rawlings (Wisconsin) Stating the findings 4 / 33

Standard model of a coin The coin is characterized by a parameter value, θ The probability of flipping a heads is θ and tails is 1 θ The fair coin has parameter value θ = 1/2 The parameter is a feature of the coin and does not change with time All flips of the coin are independent events, i.e. they are uninfluenced by the outcomes of other flips. Rawlings (Wisconsin) Stating the findings 4 / 33

Standard model of a coin The coin is characterized by a parameter value, θ The probability of flipping a heads is θ and tails is 1 θ The fair coin has parameter value θ = 1/2 The parameter is a feature of the coin and does not change with time All flips of the coin are independent events, i.e. they are uninfluenced by the outcomes of other flips. Rawlings (Wisconsin) Stating the findings 4 / 33

How do we estimate θ from the experiment? Say the coin has fixed, but unknown, parameter θ 0. Rawlings (Wisconsin) Stating the findings 5 / 33

How do we estimate θ from the experiment? Say the coin has fixed, but unknown, parameter θ 0. Given the model we can compute the probability of the observation. Let n be the number of heads. The probability of n heads, each with probability θ 0, and N n tails, each with probability 1 θ 0 is ( ) N p(n) = θ n n 0(1 θ 0 ) N n p(n) = B(n, N, θ 0 ) and the ( N n) accounts for the numbers of ways one can obtain n heads and N n tails. Rawlings (Wisconsin) Stating the findings 5 / 33

How do we estimate θ from the experiment? Say the coin has fixed, but unknown, parameter θ 0. Given the model we can compute the probability of the observation. Let n be the number of heads. The probability of n heads, each with probability θ 0, and N n tails, each with probability 1 θ 0 is ( ) N p(n) = θ n n 0(1 θ 0 ) N n p(n) = B(n, N, θ 0 ) and the ( N n) accounts for the numbers of ways one can obtain n heads and N n tails. This is the famous binomial distribution. Rawlings (Wisconsin) Stating the findings 5 / 33

The likelihood function We define the likelihood of the data L(n; θ) as this function p(n), which is valid for any value of θ ( ) N L(n; θ) = θ n (1 θ) N n n Rawlings (Wisconsin) Stating the findings 6 / 33

The likelihood function We define the likelihood of the data L(n; θ) as this function p(n), which is valid for any value of θ ( ) N L(n; θ) = θ n (1 θ) N n n We note that the likelihood depends on the parameter θ and the observation n. Rawlings (Wisconsin) Stating the findings 6 / 33

Likelihood function L(n; θ = 0.5) for this experiment 0.03 0.025 0.02 L(n; θ = 0.5) 0.015 0.01 0.005 0 0 200 400 600 800 1000 Notice that L(n; θ = 0.5) is a probability density in n (sum is one). n Rawlings (Wisconsin) Stating the findings 7 / 33

Likelihood function L(n = 527; θ) for this experiment 0.03 0.025 0.02 L(n = 527; θ) 0.015 0.01 0.005 0 0 0.2 0.4 0.6 0.8 1 Notice that L(n = 527; θ) is not a probability density in θ (area not one). θ Rawlings (Wisconsin) Stating the findings 8 / 33

Maximum likelihood estimation A sensible parameter estimation then is to find the value of θ that maximizes the likelihood of the observation n ˆθ(n) = max L(n; θ) θ Rawlings (Wisconsin) Stating the findings 9 / 33

Maximum likelihood estimation A sensible parameter estimation then is to find the value of θ that maximizes the likelihood of the observation n ˆθ(n) = max L(n; θ) θ For this problem, take L s derivative, set it to zero and find ˆθ = n N Rawlings (Wisconsin) Stating the findings 9 / 33

Maximum likelihood estimation A sensible parameter estimation then is to find the value of θ that maximizes the likelihood of the observation n ˆθ(n) = max L(n; θ) θ For this problem, take L s derivative, set it to zero and find ˆθ = n N After observing 527 heads out of 1000 flips, we conclude ˆθ = 0.527. Rawlings (Wisconsin) Stating the findings 9 / 33

Maximum likelihood estimation A sensible parameter estimation then is to find the value of θ that maximizes the likelihood of the observation n ˆθ(n) = max L(n; θ) θ For this problem, take L s derivative, set it to zero and find ˆθ = n N After observing 527 heads out of 1000 flips, we conclude ˆθ = 0.527. What could be simpler? Rawlings (Wisconsin) Stating the findings 9 / 33

Maximum likelihood estimation A sensible parameter estimation then is to find the value of θ that maximizes the likelihood of the observation n ˆθ(n) = max L(n; θ) θ For this problem, take L s derivative, set it to zero and find ˆθ = n N After observing 527 heads out of 1000 flips, we conclude ˆθ = 0.527. What could be simpler? But the question remains: can we conclude the coin is unfair? How? Rawlings (Wisconsin) Stating the findings 9 / 33

Now the controversy We want to draw a statistically valid conclusion about whether the coin is fair. Rawlings (Wisconsin) Stating the findings 10 / 33

Now the controversy We want to draw a statistically valid conclusion about whether the coin is fair. This question is the same one that the social scientists are asking about whether the data support the existence of ESP. Rawlings (Wisconsin) Stating the findings 10 / 33

Now the controversy We want to draw a statistically valid conclusion about whether the coin is fair. This question is the same one that the social scientists are asking about whether the data support the existence of ESP. We could pose the question in the form of a yes/no hypothesis: Is the coin fair? Rawlings (Wisconsin) Stating the findings 10 / 33

Hypothesis testing Significance testing in general has been a greatly overworked procedure, and in many cases where significance statements have been made it would have been better to provide an interval within which the value of the parameter would be expected to lie. Box, Hunter, and Hunter (1978, p. 109) Rawlings (Wisconsin) Stating the findings 11 / 33

Constructing confidence intervals So let s instead pursue finding the confidence intervals. Rawlings (Wisconsin) Stating the findings 12 / 33

Constructing confidence intervals So let s instead pursue finding the confidence intervals. We have an estimator ˆθ = n N Notice that ˆθ is a random variable. Why? (What is not a random variable in this problem?) Rawlings (Wisconsin) Stating the findings 12 / 33

Constructing confidence intervals So let s instead pursue finding the confidence intervals. We have an estimator ˆθ = n N Notice that ˆθ is a random variable. Why? (What is not a random variable in this problem?) We know n s probability density, so let s compute ˆθ s probability density ( ) N p n (n) = θ n n 0(1 θ 0 ) N n ( ) N pˆθ(ˆθ) = θ N ˆθ N ˆθ 0 (1 θ 0 ) N(1 ˆθ) Rawlings (Wisconsin) Stating the findings 12 / 33

Defining the confidence interval Define a new random variable z = ˆθ θ 0. Rawlings (Wisconsin) Stating the findings 13 / 33

Defining the confidence interval Define a new random variable z = ˆθ θ 0. We would like to find a positive, scalar a > 0 such that Pr( a z a) = α in which 0 < α < 1 is the confidence level. Rawlings (Wisconsin) Stating the findings 13 / 33

Defining the confidence interval Define a new random variable z = ˆθ θ 0. We would like to find a positive, scalar a > 0 such that Pr( a z a) = α in which 0 < α < 1 is the confidence level. The definition implies that there is α-level probability that the true parameter θ 0 lies in the (symmetric) confidence interval [ˆθ a, ˆθ + a], or Pr(ˆθ a θ 0 ˆθ + a) = α Rawlings (Wisconsin) Stating the findings 13 / 33

What s the rub? The problem with this approach is that the density of z depends on θ 0 ( ) N pˆθ (ˆθ) = θ N ˆθ N ˆθ 0 (1 θ 0 ) N(1 ˆθ) ( ) N p z (z) = θ N(z+θ 0) 0 (1 θ 0 ) N(1 z θ 0) N(z + θ 0 ) Rawlings (Wisconsin) Stating the findings 14 / 33

What s the rub? The problem with this approach is that the density of z depends on θ 0 ( ) N pˆθ (ˆθ) = θ N ˆθ N ˆθ 0 (1 θ 0 ) N(1 ˆθ) ( ) N p z (z) = θ N(z+θ 0) 0 (1 θ 0 ) N(1 z θ 0) N(z + θ 0 ) I cannot find a > 0 such that unless I know θ 0! Pr( a z a) = α Rawlings (Wisconsin) Stating the findings 14 / 33

The effect of θ 0 on the confidence interval 0.14 0.12 0.1 p z 0.08 0.06 θ 0 = 0.01 0.04 0.02 θ 0 = 0.5 0-0.04-0.02 0 0.02 0.04 z Rawlings (Wisconsin) Stating the findings 15 / 33

The effect of θ 0 on the confidence interval 0.03 0.025 0.02 θ 0 = 0.5 p z 0.015 0.01 0.005 0-0.04-0.02 0 0.02 0.04 z Pr(ˆθ 0.031 θ 0 ˆθ + 0.031) = 0.95 Rawlings (Wisconsin) Stating the findings 16 / 33

The effect of θ 0 on the confidence interval 0.14 0.12 0.1 p z 0.08 0.06 θ 0 = 0.01 0.04 0.02 0-0.04-0.02 0 0.02 0.04 z Pr(ˆθ 0.005 θ 0 ˆθ + 0.008) = 0.95 Rawlings (Wisconsin) Stating the findings 17 / 33

OK, so now what do we do? Change the experiment Imagine instead that I draw a random θ from a uniform distribution on the interval [0, 1]. Rawlings (Wisconsin) Stating the findings 18 / 33

OK, so now what do we do? Change the experiment Imagine instead that I draw a random θ from a uniform distribution on the interval [0, 1]. Then I collect a sample of 1000 coin flips with this coin having value θ. Rawlings (Wisconsin) Stating the findings 18 / 33

OK, so now what do we do? Change the experiment Imagine instead that I draw a random θ from a uniform distribution on the interval [0, 1]. Then I collect a sample of 1000 coin flips with this coin having value θ. What can I conclude from this experiment? Rawlings (Wisconsin) Stating the findings 18 / 33

OK, so now what do we do? Change the experiment Imagine instead that I draw a random θ from a uniform distribution on the interval [0, 1]. Then I collect a sample of 1000 coin flips with this coin having value θ. What can I conclude from this experiment? Note that both θ and n are now random variables, and they are not independent. Rawlings (Wisconsin) Stating the findings 18 / 33

OK, so now what do we do? Change the experiment Imagine instead that I draw a random θ from a uniform distribution on the interval [0, 1]. Then I collect a sample of 1000 coin flips with this coin having value θ. What can I conclude from this experiment? Note that both θ and n are now random variables, and they are not independent. There is no true parameter θ 0 in this problem. Rawlings (Wisconsin) Stating the findings 18 / 33

Conditional density Consider two random variables A, B. The conditional density of A given B denoted p A B (a b) is defined as p A B (a b) = p A,B(a, b) p B (b) p B (b) 0 Rawlings (Wisconsin) Stating the findings 19 / 33

Conditional to Bayes p A B (a b) = p A,B(a, b) p B (b) = p B,A(b, a) p B (b) = p B A(b a)p A (a) p B (b) p A B (a b) = p B A(b a)p A (a) pb A (b a)p A (a)da According to Papoulis (1984, p.30), main idea by Thomas Bayes in 1763, but final form given by Laplace several years later. Rawlings (Wisconsin) Stating the findings 20 / 33

The densities of (θ, n) The joint density {( N ) p θ,n (θ, n) = n θ n (1 θ) N n, θ [0, 1], n [0, N] 0 θ / [0, 1] or n / [0, N] Rawlings (Wisconsin) Stating the findings 21 / 33

The densities of (θ, n) The joint density {( N ) p θ,n (θ, n) = n θ n (1 θ) N n, θ [0, 1], n [0, N] 0 θ / [0, 1] or n / [0, N] The marginal densities p θ (θ) = p n (n) = n [0,N] 1 0 p(θ, n) = 1 p(θ, n)dθ = 1 N + 1 Rawlings (Wisconsin) Stating the findings 21 / 33

Bayesian posterior Computing the conditional density gives ( ) N p(θ n) = (N + 1) θ n (1 θ) N n n p(θ n) = (N + 1)B(n, N, θ) p(θ n) = β(θ, n + 1, N n + 1) Rawlings (Wisconsin) Stating the findings 22 / 33

Bayesian posterior Computing the conditional density gives ( ) N p(θ n) = (N + 1) θ n (1 θ) N n n p(θ n) = (N + 1)B(n, N, θ) p(θ n) = β(θ, n + 1, N n + 1) The Bayesian posterior is the famous beta distribution Rawlings (Wisconsin) Stating the findings 22 / 33

Bayesian posterior Computing the conditional density gives ( ) N p(θ n) = (N + 1) θ n (1 θ) N n n p(θ n) = (N + 1)B(n, N, θ) p(θ n) = β(θ, n + 1, N n + 1) The Bayesian posterior is the famous beta distribution Maximizing the posterior over θ gives the Bayesian estimate θ = max p(θ n) θ θ = n N which agrees with the maximum likelihood estimate! Rawlings (Wisconsin) Stating the findings 22 / 33

Conditional density p(θ n = 527) for this experiment 30 25 20 p(θ n = 527) 15 10 5 0 0 0.2 0.4 0.6 0.8 1 Notice that (unlike L(n = 527; θ)), p(θ n = 527) is a probability density in θ (area is one). θ Rawlings (Wisconsin) Stating the findings 23 / 33

Confidence intervals from Bayesian posterior Computing confidence intervals is unambiguous. Find [a, b] such that b a p(θ n) = α and there is α-level probability that random variable θ [a, b] after observation n. Rawlings (Wisconsin) Stating the findings 24 / 33

Closer look at the conditional density 30 25 20 α = 0.913 p(θ n = 527) 15 10 5 0 0.48 0.5 0.52 0.54 0.56 0.58 0.6 θ Rawlings (Wisconsin) Stating the findings 25 / 33

So is the coin fair? The Bayesian conclusion is that the 90% symmetric confidence interval centered at θ = 0.527 does not contain θ = 1/2. Rawlings (Wisconsin) Stating the findings 26 / 33

So is the coin fair? The Bayesian conclusion is that the 90% symmetric confidence interval centered at θ = 0.527 does not contain θ = 1/2. Therefore I conclude the coin is unfair with 90% confidence level. Rawlings (Wisconsin) Stating the findings 26 / 33

So is the coin fair? The Bayesian conclusion is that the 90% symmetric confidence interval centered at θ = 0.527 does not contain θ = 1/2. Therefore I conclude the coin is unfair with 90% confidence level. But α 91.3% confidence level does include θ = 1/2. I cannot conclude the coin is unfair with greater than 91.3% confidence. Rawlings (Wisconsin) Stating the findings 26 / 33

Back to the NY Times Is this significant evidence that the coin is weighted? Classical analysis says yes. With a fair coin, the chances of getting 527 or more heads in 1,000 flips is less than 1 in 20, or 5 percent. To put it another way: the experiment finds evidence of a weighted coin with 95 percent confidence. Rawlings (Wisconsin) Stating the findings 27 / 33

Back to the NY Times Is this significant evidence that the coin is weighted? Classical analysis says yes. With a fair coin, the chances of getting 527 or more heads in 1,000 flips is less than 1 in 20, or 5 percent. To put it another way: the experiment finds evidence of a weighted coin with 95 percent confidence. What? That better not be classical analysis. For the binomial, it is true that Pr(n 527) = 1 F (n = 526, θ = 0.5) = 0.0468 Rawlings (Wisconsin) Stating the findings 27 / 33

Back to the NY Times Is this significant evidence that the coin is weighted? Classical analysis says yes. With a fair coin, the chances of getting 527 or more heads in 1,000 flips is less than 1 in 20, or 5 percent. To put it another way: the experiment finds evidence of a weighted coin with 95 percent confidence. What? That better not be classical analysis. For the binomial, it is true that Pr(n 527) = 1 F (n = 526, θ = 0.5) = 0.0468 But so what? We don t have n 527, we have n = 527. Rawlings (Wisconsin) Stating the findings 27 / 33

Back to the NY Times Yet many statisticians do not buy it.... It is thus more accurate, these experts say, to calculate the probability of getting that one number 527 if the coin is weighted, and compare it with the probability of getting the same number if the coin is fair. Statisticians can show that this ratio cannot be higher than about 4 to 1... Rawlings (Wisconsin) Stating the findings 28 / 33

Back to the NY Times Yet many statisticians do not buy it.... It is thus more accurate, these experts say, to calculate the probability of getting that one number 527 if the coin is weighted, and compare it with the probability of getting the same number if the coin is fair. Statisticians can show that this ratio cannot be higher than about 4 to 1... Again, so what? B(527, 1000, θ) r = max θ B(527, 1000, 0.5) B(527, 1000, 0.527) = B(527, 1000, 0.5) = 4.30 Rawlings (Wisconsin) Stating the findings 28 / 33

Back to the NY Times The point here, said Dr. Rouder is that 4-to-1 odds just aren t that convincing; it s not strong evidence. And yet classical significance testing has been saying for at least 80 years that this is strong evidence, Dr. Speckman said in an e-mail. Rawlings (Wisconsin) Stating the findings 29 / 33

Back to the NY Times The point here, said Dr. Rouder is that 4-to-1 odds just aren t that convincing; it s not strong evidence. And yet classical significance testing has been saying for at least 80 years that this is strong evidence, Dr. Speckman said in an e-mail. Four-to-one odds means two possible random outcomes have probabilities 0.2 and 0.8. Rawlings (Wisconsin) Stating the findings 29 / 33

Back to the NY Times The point here, said Dr. Rouder is that 4-to-1 odds just aren t that convincing; it s not strong evidence. And yet classical significance testing has been saying for at least 80 years that this is strong evidence, Dr. Speckman said in an e-mail. Four-to-one odds means two possible random outcomes have probabilities 0.2 and 0.8. What in this problem has probability 0.2? Rawlings (Wisconsin) Stating the findings 29 / 33

Back to the NY Times The point here, said Dr. Rouder is that 4-to-1 odds just aren t that convincing; it s not strong evidence. And yet classical significance testing has been saying for at least 80 years that this is strong evidence, Dr. Speckman said in an e-mail. Four-to-one odds means two possible random outcomes have probabilities 0.2 and 0.8. What in this problem has probability 0.2? The quantity B(527, 1000, 0.5) B(527, 1000, 0.527) = 0.233 is not the probability of a random event. Rawlings (Wisconsin) Stating the findings 29 / 33

Back to the NY Times The point here, said Dr. Rouder is that 4-to-1 odds just aren t that convincing; it s not strong evidence. And yet classical significance testing has been saying for at least 80 years that this is strong evidence, Dr. Speckman said in an e-mail. Four-to-one odds means two possible random outcomes have probabilities 0.2 and 0.8. What in this problem has probability 0.2? The quantity B(527, 1000, 0.5) B(527, 1000, 0.527) = 0.233 is not the probability of a random event. Where is this statement about 4-to-1 odds coming from? Rawlings (Wisconsin) Stating the findings 29 / 33

What did we learn? Maximizing likelihood of observation gives a sensible estimator It may be problematic to compute confidence intervals when using maximum likelihood Conditional probability quantifies information Bayes theorem is (almost) a definition of conditional probability If the parameter is also considered random, it s easy to construct the posterior distribution and confidence intervals. The controversy here seems to be that some consider θ 0 a fixed parameter, but cannot then find confidence intervals that are independent of this unknown parameter. Rawlings (Wisconsin) Stating the findings 30 / 33

What did we learn? Maximizing likelihood of observation gives a sensible estimator It may be problematic to compute confidence intervals when using maximum likelihood Conditional probability quantifies information Bayes theorem is (almost) a definition of conditional probability If the parameter is also considered random, it s easy to construct the posterior distribution and confidence intervals. The controversy here seems to be that some consider θ 0 a fixed parameter, but cannot then find confidence intervals that are independent of this unknown parameter. Rawlings (Wisconsin) Stating the findings 30 / 33

What did we learn? Maximizing likelihood of observation gives a sensible estimator It may be problematic to compute confidence intervals when using maximum likelihood Conditional probability quantifies information Bayes theorem is (almost) a definition of conditional probability If the parameter is also considered random, it s easy to construct the posterior distribution and confidence intervals. The controversy here seems to be that some consider θ 0 a fixed parameter, but cannot then find confidence intervals that are independent of this unknown parameter. Rawlings (Wisconsin) Stating the findings 30 / 33

What did we learn? Maximizing likelihood of observation gives a sensible estimator It may be problematic to compute confidence intervals when using maximum likelihood Conditional probability quantifies information Bayes theorem is (almost) a definition of conditional probability If the parameter is also considered random, it s easy to construct the posterior distribution and confidence intervals. The controversy here seems to be that some consider θ 0 a fixed parameter, but cannot then find confidence intervals that are independent of this unknown parameter. Rawlings (Wisconsin) Stating the findings 30 / 33

What did we learn? Maximizing likelihood of observation gives a sensible estimator It may be problematic to compute confidence intervals when using maximum likelihood Conditional probability quantifies information Bayes theorem is (almost) a definition of conditional probability If the parameter is also considered random, it s easy to construct the posterior distribution and confidence intervals. The controversy here seems to be that some consider θ 0 a fixed parameter, but cannot then find confidence intervals that are independent of this unknown parameter. Rawlings (Wisconsin) Stating the findings 30 / 33

What did we learn? Maximizing likelihood of observation gives a sensible estimator It may be problematic to compute confidence intervals when using maximum likelihood Conditional probability quantifies information Bayes theorem is (almost) a definition of conditional probability If the parameter is also considered random, it s easy to construct the posterior distribution and confidence intervals. The controversy here seems to be that some consider θ 0 a fixed parameter, but cannot then find confidence intervals that are independent of this unknown parameter. Rawlings (Wisconsin) Stating the findings 30 / 33

Further reading Thomas Bayes. An essay towards solving a problem in the doctrine of chances. Phil. Trans. Roy. Soc., 53:370 418, 1763. Reprinted in Biometrika, 35:293 315, 1958. George E. P. Box, William G. Hunter, and J. Stuart Hunter. Statistics for Experimenters. John Wiley & Sons, New York, 1978. Athanasios Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw-Hill, Inc., second edition, 1984. Rawlings (Wisconsin) Stating the findings 31 / 33

Questions or comments? Rawlings (Wisconsin) Stating the findings 32 / 33

Study question Consider the classic maximum likelihood problem for the linear model y = X θ 0 + e in which vector y is measured, parameter θ 0 is unknown and to be estimated, e is normally distributed measurement error, and X is a constant matrix. When e N(0, σi ) and we don t know σ, we obtain the following distribution for the maximum likelihood estimate ˆθ N(θ 0, σ 2 (X T X ) 1 ) This density contains two unknown parameters, θ 0 and σ. But we can still obtain confidence intervals in this case. What s the difference? Rawlings (Wisconsin) Stating the findings 33 / 33