Bayesian Inference STA 121: Regression Analysis Artin Armagan
Bayes Rule...s! Reverend Thomas Bayes Posterior Prior p(θ y) = p(y θ)p(θ)/p(y) Likelihood - Sampling Distribution Normalizing Constant: p(y θ)p(θ)dθ
Flipping the Coin One of your friends - who s not necessarily the most reliable one - approaches you with a coin for a betting game. He says he wins every time we observe Heads after a flip. After a hundred flips, you lose all your money. At the end of a hundred flips (n=100), we observe eighty Heads (y=80). Something doesn t look right... You remember that statistics class you took last semester! Who thought you d ever use that stuff.
Bad Coin? You remember from your statistics class that the number of successes, Y, in an experiment with n Bernoulli trials is Binomially distributed with a probability of success θ. P(Y=y θ,n) = n C k θ y (1-θ) n-y It looks like a reverse problem since you know how many successes (Heads) you observed but don t know the underlying proportion (θ) that generated these successes. The data you observed definitely suggests a potential value for θ. You remember that the maximum likelihood estimator for θ in this case would be 80/100=0.8 which is far from fair. Is this sufficient information to conclude that this is not a fair coin though?
Likelihood x Prior You remember that you could make probabilistic statements about θ using some Bayesian stuff from your statistics class. You have your sampling distribution (Binomial), and need a prior distribution over θ. Since you don t have much of an idea about what θ may be a priori, you suggest that it can be any value in [0,1]. To assert your ignorance about θ, you use a uniform distribution over [0,1] as a prior distribution for θ. You remember that the product of the likelihood and the prior was proportional to the posterior distribution of the parameter of interest. p(θ y) θ y (1-θ) n-y x Constant θ y (1-θ) n-y
Is it a Beta? You remember that, if you can normalize this posterior distribution so that it integrates to one, you re all set! p(θ y) = θ y (1-θ) n-y /p(y), where p(y)= θ y (1-θ) n-y dθ. Somehow the expression θ y (1-θ) n-y looks familiar to you. You suspect you ve seen a density function that looks like this and you go back and check some of your many statistics books. Now you remember! It looks like a Beta distribution. If a random variable, θ, is Beta distributed, it s pdf is Γ(α+β)/[Γ(α)Γ(β)] x θ α-1 (1-θ) β-1.
Posterior Dashed line represents the prior distribution while the solid line is the posterior distribution of θ.
Credible Sets 99.99% 99% 99% Credible Set = [0.6817036, 0.8851065] 99.99% Credible Set = [0.6167157, 0.9192412] P(θ>0.5) 1 Victory!!!
Conclusion We can obtain actual probabilistic statements about θ using credible sets unlike the confidence intervals in frequentist inference. Here we observed that there is a 99.99% chance that θ lives in the interval [0.6167157, 0.9192412]. Note that this interval excludes the value θ = 0.5. In fact, we can go ahead and calculate the probability that θ > 0.5. P(θ>0.5) 1. Thus we have very strong evidence that this coin is not a fair one.
A Different Scenario Let this time one of your really reliable friends come to you for a little betting game with a coin. After a hundred coin tosses, you again lose all your money. However this time, since this is a good, reliable friend, you want to use some personal judgement that he wouldn t cheat with an adjusted coin. You need to find a way to assert some type if prior belief about the situation. Due to your trust in your friend, you think that the underlying θ should be centered around zero. How concentrated it is around zero is really up to how much you trust him.
Prior Density 0.6 0.8 1.0 1.2 1.4 Potential Priors 0.0 0.2 0.4 0.6 0.8 1.0 θ α=1, β=1 Prior Density 0.0 0.5 1.0 1.5 α=2, β=2 Prior Density 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.2 0.4 0.6 0.8 1.0 θ α=50, β=50 Prior Density 0 2 4 6 8 α=5, β=5 0.0 0.2 0.4 0.6 0.8 1.0 θ Prior Density 0.0 1.0 2.0 3.0 0.0 0.2 0.4 0.6 0.8 1.0 θ α=10, β=10 0.0 0.2 0.4 0.6 0.8 1.0 θ
Likelihood x Prior p(θ y) θ y (1-θ) n-y x θ α-1 (1-θ) β-1 θ y+α-1 (1-θ) n-y+β-1 Thus, θ is again Beta distributed with parameters α * = y+α and β * = n-y+β. Depending on our trust level (equivalently how much weight we want to assign to the prior belief) the posterior is going to change.
Here we used a very special prior distribution. The choice of prior distribution led to a same type posterior distribution, i.e. both prior and posterior were Beta distributions. Such priors are called conjugate priors. Posterior The stronger our belief is that the coin is a fair one, posterior starts shifting towards 0.5. Eventually, if we assign enough weight to the prior, it will overwhelm the observed data and center the posterior at 0.5. Priors Posteriors
Inference on the Population Mean This may be one of the first cases in an introductory level statistics course. You have n observations coming from a normal population with an unknown mean μ, and a known variance σ 2. Y i ~ N(μ,σ 2 ) You have learnt that the maximum likelihood estimator is arithmetic mean of the observations, μ ML.
A Conjugate Prior We briefly mentioned conjugate priors. If possible, we d like a prior that would give rise to a posterior of its own kind. A natural idea that comes to mind is the normal distribution (over μ). We know that our observations are coming from a normal distribution. We also know that the convolution (product) of two or more normal distributions would lead to another normal distribution.
Likelihood x Prior N(μ μ n,τ n2 ) μ n = (μ 0 /τ 2 +nμ ML /σ 2 )/(1/τ 2 +n/σ 2 ) 1/τ n 2 = 1/τ 02 +n/σ 2 p(μ y 1,..., y n ) p(y 1,..., y n μ) x p(μ) N(y 1 μ,σ 2 ) x... x N(y 2 μ,σ 2 ) N(μ μ 0,τ 2 )
What If No Prior Info? The posterior mean looks like a weighted average of the maximum likelihood estimator and the prior mean. How much weight we assign to the prior mean is up to how much we trust our prior belief. Note that as the variance of a normal distribution increases, it becomes flatter and flatter. As τ 2, the prior specified earlier becomes a uniform distribution over the whole domain of μ. This imposes NO prior information on the knowledge coming from the observed data.
What If Data Contradicts the Prior? Also notice that as we keep τ 2 constant and as n, the posterior again approaches a normal distribution with mean μ ML and variance σ 2 /n. This is due to the fact that the information coming from the observed data is overwhelming our prior belief. That said, although us Bayesians are statisticians with attitudes, our attitudes dwindle with observed evidence. How quickly it diminishes depends on how much trust we have on our prior knowledge.
Bayesian yet...?