What are the Findings? James B. Rawlings Department of Chemical and Biological Engineering University of Wisconsin Madison Madison, Wisconsin April 2010 Rawlings (Wisconsin) Stating the findings 1 / 33
Why look at this problem? Hypothesis testing (is it a fair coin?) Confidence intervals (assign probability to estimate) Quantifying information gained from measurement Conditional probability Bayesian estimation Intuition requires experience. This problem provides experience It s a fun problem! Rawlings (Wisconsin) Stating the findings 2 / 33
Why look at this problem? Hypothesis testing (is it a fair coin?) Confidence intervals (assign probability to estimate) Quantifying information gained from measurement Conditional probability Bayesian estimation Intuition requires experience. This problem provides experience It s a fun problem! Rawlings (Wisconsin) Stating the findings 2 / 33
Why look at this problem? Hypothesis testing (is it a fair coin?) Confidence intervals (assign probability to estimate) Quantifying information gained from measurement Conditional probability Bayesian estimation Intuition requires experience. This problem provides experience It s a fun problem! Rawlings (Wisconsin) Stating the findings 2 / 33
Why look at this problem? Hypothesis testing (is it a fair coin?) Confidence intervals (assign probability to estimate) Quantifying information gained from measurement Conditional probability Bayesian estimation Intuition requires experience. This problem provides experience It s a fun problem! Rawlings (Wisconsin) Stating the findings 2 / 33
Why look at this problem? Hypothesis testing (is it a fair coin?) Confidence intervals (assign probability to estimate) Quantifying information gained from measurement Conditional probability Bayesian estimation Intuition requires experience. This problem provides experience It s a fun problem! Rawlings (Wisconsin) Stating the findings 2 / 33
Why look at this problem? Hypothesis testing (is it a fair coin?) Confidence intervals (assign probability to estimate) Quantifying information gained from measurement Conditional probability Bayesian estimation Intuition requires experience. This problem provides experience It s a fun problem! Rawlings (Wisconsin) Stating the findings 2 / 33
Why look at this problem? Hypothesis testing (is it a fair coin?) Confidence intervals (assign probability to estimate) Quantifying information gained from measurement Conditional probability Bayesian estimation Intuition requires experience. This problem provides experience It s a fun problem! Rawlings (Wisconsin) Stating the findings 2 / 33
Is it a fair coin? I am given a coin I want to know if it is a fair coin So I flip it 1000 times and observe 527 heads What can I conclude? Rawlings (Wisconsin) Stating the findings 3 / 33
Is it a fair coin? I am given a coin I want to know if it is a fair coin So I flip it 1000 times and observe 527 heads What can I conclude? Rawlings (Wisconsin) Stating the findings 3 / 33
Is it a fair coin? I am given a coin I want to know if it is a fair coin So I flip it 1000 times and observe 527 heads What can I conclude? Rawlings (Wisconsin) Stating the findings 3 / 33
Is it a fair coin? I am given a coin I want to know if it is a fair coin So I flip it 1000 times and observe 527 heads What can I conclude? Rawlings (Wisconsin) Stating the findings 3 / 33
Standard model of a coin The coin is characterized by a parameter value, θ The probability of flipping a heads is θ and tails is 1 θ The fair coin has parameter value θ = 1/2 The parameter is a feature of the coin and does not change with time All flips of the coin are independent events, i.e. they are uninfluenced by the outcomes of other flips. Rawlings (Wisconsin) Stating the findings 4 / 33
Standard model of a coin The coin is characterized by a parameter value, θ The probability of flipping a heads is θ and tails is 1 θ The fair coin has parameter value θ = 1/2 The parameter is a feature of the coin and does not change with time All flips of the coin are independent events, i.e. they are uninfluenced by the outcomes of other flips. Rawlings (Wisconsin) Stating the findings 4 / 33
Standard model of a coin The coin is characterized by a parameter value, θ The probability of flipping a heads is θ and tails is 1 θ The fair coin has parameter value θ = 1/2 The parameter is a feature of the coin and does not change with time All flips of the coin are independent events, i.e. they are uninfluenced by the outcomes of other flips. Rawlings (Wisconsin) Stating the findings 4 / 33
Standard model of a coin The coin is characterized by a parameter value, θ The probability of flipping a heads is θ and tails is 1 θ The fair coin has parameter value θ = 1/2 The parameter is a feature of the coin and does not change with time All flips of the coin are independent events, i.e. they are uninfluenced by the outcomes of other flips. Rawlings (Wisconsin) Stating the findings 4 / 33
Standard model of a coin The coin is characterized by a parameter value, θ The probability of flipping a heads is θ and tails is 1 θ The fair coin has parameter value θ = 1/2 The parameter is a feature of the coin and does not change with time All flips of the coin are independent events, i.e. they are uninfluenced by the outcomes of other flips. Rawlings (Wisconsin) Stating the findings 4 / 33
How do we estimate θ from the experiment? Say the coin has fixed, but unknown, parameter θ 0. Rawlings (Wisconsin) Stating the findings 5 / 33
How do we estimate θ from the experiment? Say the coin has fixed, but unknown, parameter θ 0. Given the model we can compute the probability of the observation. Let n be the number of heads. The probability of n heads, each with probability θ 0, and N n tails, each with probability 1 θ 0 is ( ) N p(n) = θ n n 0(1 θ 0 ) N n p(n) = B(n, N, θ 0 ) and the ( N n) accounts for the numbers of ways one can obtain n heads and N n tails. Rawlings (Wisconsin) Stating the findings 5 / 33
How do we estimate θ from the experiment? Say the coin has fixed, but unknown, parameter θ 0. Given the model we can compute the probability of the observation. Let n be the number of heads. The probability of n heads, each with probability θ 0, and N n tails, each with probability 1 θ 0 is ( ) N p(n) = θ n n 0(1 θ 0 ) N n p(n) = B(n, N, θ 0 ) and the ( N n) accounts for the numbers of ways one can obtain n heads and N n tails. This is the famous binomial distribution. Rawlings (Wisconsin) Stating the findings 5 / 33
The likelihood function We define the likelihood of the data L(n; θ) as this function p(n), which is valid for any value of θ ( ) N L(n; θ) = θ n (1 θ) N n n Rawlings (Wisconsin) Stating the findings 6 / 33
The likelihood function We define the likelihood of the data L(n; θ) as this function p(n), which is valid for any value of θ ( ) N L(n; θ) = θ n (1 θ) N n n We note that the likelihood depends on the parameter θ and the observation n. Rawlings (Wisconsin) Stating the findings 6 / 33
Likelihood function L(n; θ = 0.5) for this experiment 0.03 0.025 0.02 L(n; θ = 0.5) 0.015 0.01 0.005 0 0 200 400 600 800 1000 Notice that L(n; θ = 0.5) is a probability density in n (sum is one). n Rawlings (Wisconsin) Stating the findings 7 / 33
Likelihood function L(n = 527; θ) for this experiment 0.03 0.025 0.02 L(n = 527; θ) 0.015 0.01 0.005 0 0 0.2 0.4 0.6 0.8 1 Notice that L(n = 527; θ) is not a probability density in θ (area not one). θ Rawlings (Wisconsin) Stating the findings 8 / 33
Maximum likelihood estimation A sensible parameter estimation then is to find the value of θ that maximizes the likelihood of the observation n ˆθ(n) = max L(n; θ) θ Rawlings (Wisconsin) Stating the findings 9 / 33
Maximum likelihood estimation A sensible parameter estimation then is to find the value of θ that maximizes the likelihood of the observation n ˆθ(n) = max L(n; θ) θ For this problem, take L s derivative, set it to zero and find ˆθ = n N Rawlings (Wisconsin) Stating the findings 9 / 33
Maximum likelihood estimation A sensible parameter estimation then is to find the value of θ that maximizes the likelihood of the observation n ˆθ(n) = max L(n; θ) θ For this problem, take L s derivative, set it to zero and find ˆθ = n N After observing 527 heads out of 1000 flips, we conclude ˆθ = 0.527. Rawlings (Wisconsin) Stating the findings 9 / 33
Maximum likelihood estimation A sensible parameter estimation then is to find the value of θ that maximizes the likelihood of the observation n ˆθ(n) = max L(n; θ) θ For this problem, take L s derivative, set it to zero and find ˆθ = n N After observing 527 heads out of 1000 flips, we conclude ˆθ = 0.527. What could be simpler? Rawlings (Wisconsin) Stating the findings 9 / 33
Maximum likelihood estimation A sensible parameter estimation then is to find the value of θ that maximizes the likelihood of the observation n ˆθ(n) = max L(n; θ) θ For this problem, take L s derivative, set it to zero and find ˆθ = n N After observing 527 heads out of 1000 flips, we conclude ˆθ = 0.527. What could be simpler? But the question remains: can we conclude the coin is unfair? How? Rawlings (Wisconsin) Stating the findings 9 / 33
Now the controversy We want to draw a statistically valid conclusion about whether the coin is fair. Rawlings (Wisconsin) Stating the findings 10 / 33
Now the controversy We want to draw a statistically valid conclusion about whether the coin is fair. This question is the same one that the social scientists are asking about whether the data support the existence of ESP. Rawlings (Wisconsin) Stating the findings 10 / 33
Now the controversy We want to draw a statistically valid conclusion about whether the coin is fair. This question is the same one that the social scientists are asking about whether the data support the existence of ESP. We could pose the question in the form of a yes/no hypothesis: Is the coin fair? Rawlings (Wisconsin) Stating the findings 10 / 33
Hypothesis testing Significance testing in general has been a greatly overworked procedure, and in many cases where significance statements have been made it would have been better to provide an interval within which the value of the parameter would be expected to lie. Box, Hunter, and Hunter (1978, p. 109) Rawlings (Wisconsin) Stating the findings 11 / 33
Constructing confidence intervals So let s instead pursue finding the confidence intervals. Rawlings (Wisconsin) Stating the findings 12 / 33
Constructing confidence intervals So let s instead pursue finding the confidence intervals. We have an estimator ˆθ = n N Notice that ˆθ is a random variable. Why? (What is not a random variable in this problem?) Rawlings (Wisconsin) Stating the findings 12 / 33
Constructing confidence intervals So let s instead pursue finding the confidence intervals. We have an estimator ˆθ = n N Notice that ˆθ is a random variable. Why? (What is not a random variable in this problem?) We know n s probability density, so let s compute ˆθ s probability density ( ) N p n (n) = θ n n 0(1 θ 0 ) N n ( ) N pˆθ(ˆθ) = θ N ˆθ N ˆθ 0 (1 θ 0 ) N(1 ˆθ) Rawlings (Wisconsin) Stating the findings 12 / 33
Defining the confidence interval Define a new random variable z = ˆθ θ 0. Rawlings (Wisconsin) Stating the findings 13 / 33
Defining the confidence interval Define a new random variable z = ˆθ θ 0. We would like to find a positive, scalar a > 0 such that Pr( a z a) = α in which 0 < α < 1 is the confidence level. Rawlings (Wisconsin) Stating the findings 13 / 33
Defining the confidence interval Define a new random variable z = ˆθ θ 0. We would like to find a positive, scalar a > 0 such that Pr( a z a) = α in which 0 < α < 1 is the confidence level. The definition implies that there is α-level probability that the true parameter θ 0 lies in the (symmetric) confidence interval [ˆθ a, ˆθ + a], or Pr(ˆθ a θ 0 ˆθ + a) = α Rawlings (Wisconsin) Stating the findings 13 / 33
What s the rub? The problem with this approach is that the density of z depends on θ 0 ( ) N pˆθ (ˆθ) = θ N ˆθ N ˆθ 0 (1 θ 0 ) N(1 ˆθ) ( ) N p z (z) = θ N(z+θ 0) 0 (1 θ 0 ) N(1 z θ 0) N(z + θ 0 ) Rawlings (Wisconsin) Stating the findings 14 / 33
What s the rub? The problem with this approach is that the density of z depends on θ 0 ( ) N pˆθ (ˆθ) = θ N ˆθ N ˆθ 0 (1 θ 0 ) N(1 ˆθ) ( ) N p z (z) = θ N(z+θ 0) 0 (1 θ 0 ) N(1 z θ 0) N(z + θ 0 ) I cannot find a > 0 such that unless I know θ 0! Pr( a z a) = α Rawlings (Wisconsin) Stating the findings 14 / 33
The effect of θ 0 on the confidence interval 0.14 0.12 0.1 p z 0.08 0.06 θ 0 = 0.01 0.04 0.02 θ 0 = 0.5 0-0.04-0.02 0 0.02 0.04 z Rawlings (Wisconsin) Stating the findings 15 / 33
The effect of θ 0 on the confidence interval 0.03 0.025 0.02 θ 0 = 0.5 p z 0.015 0.01 0.005 0-0.04-0.02 0 0.02 0.04 z Pr(ˆθ 0.031 θ 0 ˆθ + 0.031) = 0.95 Rawlings (Wisconsin) Stating the findings 16 / 33
The effect of θ 0 on the confidence interval 0.14 0.12 0.1 p z 0.08 0.06 θ 0 = 0.01 0.04 0.02 0-0.04-0.02 0 0.02 0.04 z Pr(ˆθ 0.005 θ 0 ˆθ + 0.008) = 0.95 Rawlings (Wisconsin) Stating the findings 17 / 33
OK, so now what do we do? Change the experiment Imagine instead that I draw a random θ from a uniform distribution on the interval [0, 1]. Rawlings (Wisconsin) Stating the findings 18 / 33
OK, so now what do we do? Change the experiment Imagine instead that I draw a random θ from a uniform distribution on the interval [0, 1]. Then I collect a sample of 1000 coin flips with this coin having value θ. Rawlings (Wisconsin) Stating the findings 18 / 33
OK, so now what do we do? Change the experiment Imagine instead that I draw a random θ from a uniform distribution on the interval [0, 1]. Then I collect a sample of 1000 coin flips with this coin having value θ. What can I conclude from this experiment? Rawlings (Wisconsin) Stating the findings 18 / 33
OK, so now what do we do? Change the experiment Imagine instead that I draw a random θ from a uniform distribution on the interval [0, 1]. Then I collect a sample of 1000 coin flips with this coin having value θ. What can I conclude from this experiment? Note that both θ and n are now random variables, and they are not independent. Rawlings (Wisconsin) Stating the findings 18 / 33
OK, so now what do we do? Change the experiment Imagine instead that I draw a random θ from a uniform distribution on the interval [0, 1]. Then I collect a sample of 1000 coin flips with this coin having value θ. What can I conclude from this experiment? Note that both θ and n are now random variables, and they are not independent. There is no true parameter θ 0 in this problem. Rawlings (Wisconsin) Stating the findings 18 / 33
Conditional density Consider two random variables A, B. The conditional density of A given B denoted p A B (a b) is defined as p A B (a b) = p A,B(a, b) p B (b) p B (b) 0 Rawlings (Wisconsin) Stating the findings 19 / 33
Conditional to Bayes p A B (a b) = p A,B(a, b) p B (b) = p B,A(b, a) p B (b) = p B A(b a)p A (a) p B (b) p A B (a b) = p B A(b a)p A (a) pb A (b a)p A (a)da According to Papoulis (1984, p.30), main idea by Thomas Bayes in 1763, but final form given by Laplace several years later. Rawlings (Wisconsin) Stating the findings 20 / 33
The densities of (θ, n) The joint density {( N ) p θ,n (θ, n) = n θ n (1 θ) N n, θ [0, 1], n [0, N] 0 θ / [0, 1] or n / [0, N] Rawlings (Wisconsin) Stating the findings 21 / 33
The densities of (θ, n) The joint density {( N ) p θ,n (θ, n) = n θ n (1 θ) N n, θ [0, 1], n [0, N] 0 θ / [0, 1] or n / [0, N] The marginal densities p θ (θ) = p n (n) = n [0,N] 1 0 p(θ, n) = 1 p(θ, n)dθ = 1 N + 1 Rawlings (Wisconsin) Stating the findings 21 / 33
Bayesian posterior Computing the conditional density gives ( ) N p(θ n) = (N + 1) θ n (1 θ) N n n p(θ n) = (N + 1)B(n, N, θ) p(θ n) = β(θ, n + 1, N n + 1) Rawlings (Wisconsin) Stating the findings 22 / 33
Bayesian posterior Computing the conditional density gives ( ) N p(θ n) = (N + 1) θ n (1 θ) N n n p(θ n) = (N + 1)B(n, N, θ) p(θ n) = β(θ, n + 1, N n + 1) The Bayesian posterior is the famous beta distribution Rawlings (Wisconsin) Stating the findings 22 / 33
Bayesian posterior Computing the conditional density gives ( ) N p(θ n) = (N + 1) θ n (1 θ) N n n p(θ n) = (N + 1)B(n, N, θ) p(θ n) = β(θ, n + 1, N n + 1) The Bayesian posterior is the famous beta distribution Maximizing the posterior over θ gives the Bayesian estimate θ = max p(θ n) θ θ = n N which agrees with the maximum likelihood estimate! Rawlings (Wisconsin) Stating the findings 22 / 33
Conditional density p(θ n = 527) for this experiment 30 25 20 p(θ n = 527) 15 10 5 0 0 0.2 0.4 0.6 0.8 1 Notice that (unlike L(n = 527; θ)), p(θ n = 527) is a probability density in θ (area is one). θ Rawlings (Wisconsin) Stating the findings 23 / 33
Confidence intervals from Bayesian posterior Computing confidence intervals is unambiguous. Find [a, b] such that b a p(θ n) = α and there is α-level probability that random variable θ [a, b] after observation n. Rawlings (Wisconsin) Stating the findings 24 / 33
Closer look at the conditional density 30 25 20 α = 0.913 p(θ n = 527) 15 10 5 0 0.48 0.5 0.52 0.54 0.56 0.58 0.6 θ Rawlings (Wisconsin) Stating the findings 25 / 33
So is the coin fair? The Bayesian conclusion is that the 90% symmetric confidence interval centered at θ = 0.527 does not contain θ = 1/2. Rawlings (Wisconsin) Stating the findings 26 / 33
So is the coin fair? The Bayesian conclusion is that the 90% symmetric confidence interval centered at θ = 0.527 does not contain θ = 1/2. Therefore I conclude the coin is unfair with 90% confidence level. Rawlings (Wisconsin) Stating the findings 26 / 33
So is the coin fair? The Bayesian conclusion is that the 90% symmetric confidence interval centered at θ = 0.527 does not contain θ = 1/2. Therefore I conclude the coin is unfair with 90% confidence level. But α 91.3% confidence level does include θ = 1/2. I cannot conclude the coin is unfair with greater than 91.3% confidence. Rawlings (Wisconsin) Stating the findings 26 / 33
Back to the NY Times Is this significant evidence that the coin is weighted? Classical analysis says yes. With a fair coin, the chances of getting 527 or more heads in 1,000 flips is less than 1 in 20, or 5 percent. To put it another way: the experiment finds evidence of a weighted coin with 95 percent confidence. Rawlings (Wisconsin) Stating the findings 27 / 33
Back to the NY Times Is this significant evidence that the coin is weighted? Classical analysis says yes. With a fair coin, the chances of getting 527 or more heads in 1,000 flips is less than 1 in 20, or 5 percent. To put it another way: the experiment finds evidence of a weighted coin with 95 percent confidence. What? That better not be classical analysis. For the binomial, it is true that Pr(n 527) = 1 F (n = 526, θ = 0.5) = 0.0468 Rawlings (Wisconsin) Stating the findings 27 / 33
Back to the NY Times Is this significant evidence that the coin is weighted? Classical analysis says yes. With a fair coin, the chances of getting 527 or more heads in 1,000 flips is less than 1 in 20, or 5 percent. To put it another way: the experiment finds evidence of a weighted coin with 95 percent confidence. What? That better not be classical analysis. For the binomial, it is true that Pr(n 527) = 1 F (n = 526, θ = 0.5) = 0.0468 But so what? We don t have n 527, we have n = 527. Rawlings (Wisconsin) Stating the findings 27 / 33
Back to the NY Times Yet many statisticians do not buy it.... It is thus more accurate, these experts say, to calculate the probability of getting that one number 527 if the coin is weighted, and compare it with the probability of getting the same number if the coin is fair. Statisticians can show that this ratio cannot be higher than about 4 to 1... Rawlings (Wisconsin) Stating the findings 28 / 33
Back to the NY Times Yet many statisticians do not buy it.... It is thus more accurate, these experts say, to calculate the probability of getting that one number 527 if the coin is weighted, and compare it with the probability of getting the same number if the coin is fair. Statisticians can show that this ratio cannot be higher than about 4 to 1... Again, so what? B(527, 1000, θ) r = max θ B(527, 1000, 0.5) B(527, 1000, 0.527) = B(527, 1000, 0.5) = 4.30 Rawlings (Wisconsin) Stating the findings 28 / 33
Back to the NY Times The point here, said Dr. Rouder is that 4-to-1 odds just aren t that convincing; it s not strong evidence. And yet classical significance testing has been saying for at least 80 years that this is strong evidence, Dr. Speckman said in an e-mail. Rawlings (Wisconsin) Stating the findings 29 / 33
Back to the NY Times The point here, said Dr. Rouder is that 4-to-1 odds just aren t that convincing; it s not strong evidence. And yet classical significance testing has been saying for at least 80 years that this is strong evidence, Dr. Speckman said in an e-mail. Four-to-one odds means two possible random outcomes have probabilities 0.2 and 0.8. Rawlings (Wisconsin) Stating the findings 29 / 33
Back to the NY Times The point here, said Dr. Rouder is that 4-to-1 odds just aren t that convincing; it s not strong evidence. And yet classical significance testing has been saying for at least 80 years that this is strong evidence, Dr. Speckman said in an e-mail. Four-to-one odds means two possible random outcomes have probabilities 0.2 and 0.8. What in this problem has probability 0.2? Rawlings (Wisconsin) Stating the findings 29 / 33
Back to the NY Times The point here, said Dr. Rouder is that 4-to-1 odds just aren t that convincing; it s not strong evidence. And yet classical significance testing has been saying for at least 80 years that this is strong evidence, Dr. Speckman said in an e-mail. Four-to-one odds means two possible random outcomes have probabilities 0.2 and 0.8. What in this problem has probability 0.2? The quantity B(527, 1000, 0.5) B(527, 1000, 0.527) = 0.233 is not the probability of a random event. Rawlings (Wisconsin) Stating the findings 29 / 33
Back to the NY Times The point here, said Dr. Rouder is that 4-to-1 odds just aren t that convincing; it s not strong evidence. And yet classical significance testing has been saying for at least 80 years that this is strong evidence, Dr. Speckman said in an e-mail. Four-to-one odds means two possible random outcomes have probabilities 0.2 and 0.8. What in this problem has probability 0.2? The quantity B(527, 1000, 0.5) B(527, 1000, 0.527) = 0.233 is not the probability of a random event. Where is this statement about 4-to-1 odds coming from? Rawlings (Wisconsin) Stating the findings 29 / 33
What did we learn? Maximizing likelihood of observation gives a sensible estimator It may be problematic to compute confidence intervals when using maximum likelihood Conditional probability quantifies information Bayes theorem is (almost) a definition of conditional probability If the parameter is also considered random, it s easy to construct the posterior distribution and confidence intervals. The controversy here seems to be that some consider θ 0 a fixed parameter, but cannot then find confidence intervals that are independent of this unknown parameter. Rawlings (Wisconsin) Stating the findings 30 / 33
What did we learn? Maximizing likelihood of observation gives a sensible estimator It may be problematic to compute confidence intervals when using maximum likelihood Conditional probability quantifies information Bayes theorem is (almost) a definition of conditional probability If the parameter is also considered random, it s easy to construct the posterior distribution and confidence intervals. The controversy here seems to be that some consider θ 0 a fixed parameter, but cannot then find confidence intervals that are independent of this unknown parameter. Rawlings (Wisconsin) Stating the findings 30 / 33
What did we learn? Maximizing likelihood of observation gives a sensible estimator It may be problematic to compute confidence intervals when using maximum likelihood Conditional probability quantifies information Bayes theorem is (almost) a definition of conditional probability If the parameter is also considered random, it s easy to construct the posterior distribution and confidence intervals. The controversy here seems to be that some consider θ 0 a fixed parameter, but cannot then find confidence intervals that are independent of this unknown parameter. Rawlings (Wisconsin) Stating the findings 30 / 33
What did we learn? Maximizing likelihood of observation gives a sensible estimator It may be problematic to compute confidence intervals when using maximum likelihood Conditional probability quantifies information Bayes theorem is (almost) a definition of conditional probability If the parameter is also considered random, it s easy to construct the posterior distribution and confidence intervals. The controversy here seems to be that some consider θ 0 a fixed parameter, but cannot then find confidence intervals that are independent of this unknown parameter. Rawlings (Wisconsin) Stating the findings 30 / 33
What did we learn? Maximizing likelihood of observation gives a sensible estimator It may be problematic to compute confidence intervals when using maximum likelihood Conditional probability quantifies information Bayes theorem is (almost) a definition of conditional probability If the parameter is also considered random, it s easy to construct the posterior distribution and confidence intervals. The controversy here seems to be that some consider θ 0 a fixed parameter, but cannot then find confidence intervals that are independent of this unknown parameter. Rawlings (Wisconsin) Stating the findings 30 / 33
What did we learn? Maximizing likelihood of observation gives a sensible estimator It may be problematic to compute confidence intervals when using maximum likelihood Conditional probability quantifies information Bayes theorem is (almost) a definition of conditional probability If the parameter is also considered random, it s easy to construct the posterior distribution and confidence intervals. The controversy here seems to be that some consider θ 0 a fixed parameter, but cannot then find confidence intervals that are independent of this unknown parameter. Rawlings (Wisconsin) Stating the findings 30 / 33
Further reading Thomas Bayes. An essay towards solving a problem in the doctrine of chances. Phil. Trans. Roy. Soc., 53:370 418, 1763. Reprinted in Biometrika, 35:293 315, 1958. George E. P. Box, William G. Hunter, and J. Stuart Hunter. Statistics for Experimenters. John Wiley & Sons, New York, 1978. Athanasios Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw-Hill, Inc., second edition, 1984. Rawlings (Wisconsin) Stating the findings 31 / 33
Questions or comments? Rawlings (Wisconsin) Stating the findings 32 / 33
Study question Consider the classic maximum likelihood problem for the linear model y = X θ 0 + e in which vector y is measured, parameter θ 0 is unknown and to be estimated, e is normally distributed measurement error, and X is a constant matrix. When e N(0, σi ) and we don t know σ, we obtain the following distribution for the maximum likelihood estimate ˆθ N(θ 0, σ 2 (X T X ) 1 ) This density contains two unknown parameters, θ 0 and σ. But we can still obtain confidence intervals in this case. What s the difference? Rawlings (Wisconsin) Stating the findings 33 / 33