2. I will provide two examples to contrast the two schools. Hopefully, in the end you will fall in love with the Bayesian approach.

Size: px

Start display at page:

Download "2. I will provide two examples to contrast the two schools. Hopefully, in the end you will fall in love with the Bayesian approach."

Drusilla Gibbs
5 years ago
Views:

1 Bayesian Statistics 1. In the new era of big data, machine learning and artificial intelligence, it is important for students to know the vocabulary of Bayesian statistics, which has been competing with the classical school (frequentists) throughout the history of statistics. 2. I will provide two examples to contrast the two schools. Hopefully, in the end you will fall in love with the Bayesian approach. Example I: Estimating Proportion 1. Imagine you are a millionaire planning to buy a lake. You love eating walleye, so you want to buy a lake with a lot of walleyes in it. The parameter of interest is the proportion of walleyes in fish population in a lake. 2. Suppose you have no prior information about the proportion of walleye. That means you believe the proportion could be 0, or 10%, or 20%,... or 100% with equal probabilities. The walleye density is low if proportion is 0.1, while the density is high if proportion is Using the jargon of Bayesian statistics, you think the proportion of walleyes is a random variable, and the prior distribution is flat (kind of like a uniform distribution). 4. Here comes the first difference between Frequentist and Bayes: Bayesian school treats any unknown parameter (here, proportion of walleyes) as a random variable, while Frequentist treats a parameter as an unknown constant. As a result, Bayesian school implies implicitly that we will never know for sure the unknown parameter (since it is a random variable). 5. The prior distribution is subjective. One person s prior distribution can differ from another person s. For the same person, the prior distribution can evolve over time as more information becomes available. 6. For example, suppose a friend tells you that there used to be a lot of walleyes in the lake, and you trust him. Then you may assign higher probabilities to those big proportions. Then the new prior distribution can be P (θ = 0.1) = 0.2, P (θ = 0.9) = Here comes the second difference between Frequentist and Bayes: Bayes thinks probability measures the degree of belief, while Frequentist treats probability as the long 1

2 run frequency. Probability of 0.8 indicates that you have strong faith in what your friend tells you. In that regard, probability is also subjective in the Bayesian world. 8. For Bayesian school, statistical inference amounts to using information (data) to update the belief. More explicitly, the Bayes theorem states that P (θ data) = P (θ)p (data θ) P (data) (Bayes theorem) (1) where (a) P (θ) is called prior distribution of the unknown parameter θ the belief you have about θ before seeing the data (b) P (data θ) is called likelihood the probability that you observe the given sample of data conditional on the parameter (c) P (data) is the unconditional or marginal probability of observing the given sample (d) Most importantly, P (θ data) is the posterior distribution the updated belief about θ after information has been digested. Simply put, Bayesian method is concerned with moving from P (θ) to P (θ data), or moving from prior distribution to posterior distribution, using Bayes theorem (1). You can find the discussion of Bayes theorem in any statistics book or on Internet. 9. We can show that P (data) is free of θ since it has been integrated out P (data) = θ P (data, θ) = θ P (θ)p (data θ) (2) where P (data, θ) is the joint distribution. That means for the purpose of understanding θ, we could ignore the denominator in (1) and write P (θ data) P (θ)p (data θ) (3) where represents being proportional to. In short, to obtain the updated belief (posterior), we may only need to figure out prior and likelihood. 10. A technical note: when the information set is big (as sample size goes to infinity), the central limit theorem implies that the likelihood typically converges to a 2

3 normal distribution. So in the limit, the likelihood (normal distribution) dominates the prior, and in general the posterior is a bell-shaped curve. 11. For Bayesian school, the most important result is delivered by plotting the posterior distribution. Next I will show you how to do so for the example I. 12. First we need data (sample). Suppose you spend a whole day catching 10 fishes in that lake, and 3 of them are walleyes. For Frequentist, the estimate for population proportion is just the sample proportion 3 = 30%. That s almost it! The Frequentist now 10 believes we almost know the proportion of walleye, and it is 30%. But the Frequentist admits that there can be sampling uncertainty (another Frequentist possibly catches 4 or even 5 walleyes out of 10 fishes). So they compute the standard error se, and report a confidence interval ( se, se). They tell you that with 95% probability the true proportion is inside that interval. That is it! Then the Frequentist goes to the party. 13. For Bayesian school, they are unhappy with just an interval they want to know the whole (posterior) distribution of θ (the possible values and corresponding probabilities) (a) For simplicity, assuming flat prior distribution P (θ = k) = 1, k = 0, 0.1, 0.2,..., In words, you believe equal probabilities for low and high densities of walleye population. (b) The likelihood or probability of getting 3 walleyes out of 10 fishes, for given θ, is given by a binomial distribution Likelihood = P (3 sucesses out of 10 trials) = C 3 10θ 3 (1 θ) 10 3 (4) where C denotes the number of combinations. Basically, if the probability of success (catching a walleye) is θ, then having m successes out of n trials is C m n θ m (1 θ) n m. Please google binomial distribution to learn more. (c) According to (1) and (2), next we need to multiply the prior by the likelihood and divide the sum of that product. The prior distribution for θ and its posterior distribution after we catch 3 walleyes out of 10 fishes are 3

4 Prior (Red) and Posterior (Blue) Distributions Distribution theta po pr where a green line is drawn to highlight the value of θ = 0.3 that occurs with highest probability. (a) You can think of that number as the Bayesian point estimate of the proportion ˆθ Bayesian = 0.3, which in this case is identical to the Frequentist point estimate ˆθ Frequentist = 0.3. (b) The posterior distribution clearly shows that other values are possible. For instance, either θ = 0.2 or θ = 0.4 can be true with substantial probability. Nevertheless, θ 0.1 or θ 0.6 are unlikely given this sample. (c) So the information (catching 3 walleyes out of 10) is used to update our belief about the proportion we move from the flat prior distribution (red one) to the bell-shaped posterior distribution (blue one). The bell shape confirms the dominance of likelihood. After showing this graph, the Bayesian person finally can join the party! 14. The stata codes are clear set obs 11 sca n = 10 sca m = 3 gen pv = ([_n]-1)*0.1 gen pr = 1/11 4

5 gen lv = binomial(n, m, pv)-binomial(n, (m-1), pv) gen po = lv*pr qui sum po qui replace po = po/r(sum) twoway (connected po pv, ms(th)) (connected pr pv, ms(oh)), ytitle("distribution") where the stata function binomial(n, m, pv) reports P (m or more sucesses out of n trials). 15. What if we catch 6 walleyes out of 10 fishes? You only need to change sca m = 6 in my codes and get Posteior Posterior Distribution theta Now the most likely proportion is 0.6! If I am that walleye-loving millionaire, I may decide to buy the lake. 16. Of course, to be safe, the millionaire can try to get a bigger sample (catch 20 fishes and count how many are walleyes). He can also use the current bell-shaped posterior distribution (other than the naive flat one) as the new prior distribution, and try to get the second-round posterior distribution after catching a bigger sample of fishes. The point is, the Bayes method typically is used in an iterative fashion. 17. To summarize, the Bayesian method uses information to keep updating the belief. After more information arrives, we can update the belief again and again (by plotting posterior distribution again and again). Then informed decision can be made based on the posterior distribution. Bayes statistics is on-going statistics. 5

6 Example II: Mission Impossible for Frequentist (played not by Tom Cruise) 1. There are some problems that the Frequentist simply cannot help. Consider instead of proportion of walleyes, the millionaire wants to know the total number of all fishes (not just walleye). He will only buy the lake if that number is greater than, say, 50. I don t think you have learned any classical statistical model to estimate the size of population. So this is a mission almost impossible for a Frequentist Don t forget those Bayesian guys. This is how Bayesian method works for this tricky problem. Let s first catch 10 fishes, put red paint on them (or tattoo them if you can) and send them back into the lake. After one day, let s catch another 10 fishes, and count how many have red paint. 3. Suppose we have 3 red ones out of the 10 fishes. Intuitively we can do the math 10 red fishes population = 3 10 population 33 This calculation assumes the red fishes spread evenly in the lake. What if they do not (a tattooed guy may like to hang out with another tattooed guy)? So there should be uncertainty associated with the estimate 33. In this case, there is no way a Frequentist can give you something like standard error. Only the Bayes method can be used to account for that inherent uncertainty. 4. Let the unknown parameter be the population size θ = n. The key insight is that the probability of catching red fishes depends on n : P (sucess) = 10 n. Hence we can still use the Binomial distribution to solve this problem: P (3 sucesses out of 10 trials) = C 3 10 ( ) 3 ( ) 10 3 (5) n n 5. For the prior, again let s use flat one P (n = k) = 0.1, k = 10, 20,..., 100 as a starting point. The posterior distribution for the population size after catching 3 red fishes out of 10 is plotted below 1 I say almost because maximum likelihood method can be used by Frequentist. 6

7 Prior (Red) and Posterior (Blue) Distributions Distribution n po pr where the green line marks the most likely population size n = 30. The stata codes are clear set obs 10 sca m = 3 gen nv = ([_n])*10 gen pv = 10/nv gen pr = 1/10 gen lv = binomial(10, m, pv)-binomial(10, (m-1), pv) gen po = lv*pr qui sum po qui replace po = po/r(sum) twoway (connected po nv, ms(th)) (connected pr nv, ms(oh)), ytitle("distribution") 6. Exercise: please modify my codes to do a finer search for population size n = 10, 11, 12,..., The possible population sizes and their corresponding probabilities are. list nv po nv po

8 Bayes is lovable 1. First of all, economists adore Bayes. For example, they use utility function to measure how happy the millionaire is after buying that lake (and eating all those poor fishes). Because the fish population is random, economists compute something called expected utility E(u(c)) = j u(c j )P (c j ) = 10(0) + 20( ) ( ) Here we assume the consumption is fish, and use square root function because it satisfies diminishing marginal utility. Because probability is readily available from the posterior distribution, Bayes result can be incorporated seamlessly into the consumer theory. 2. In fact, any theory or problem that involves uncertainty (probability) can use the help of Bayesian statistics. Alan M. Turing used Bayesian to crack the German encrypted military code during WWII; The British navy used Bayesian to narrow down the sea area when hunting German U-boat; Google engineers use Bayesian to guess whether a picture is dog or cat; Dr. Li uses Bayesian to show off in front of his young kids... If you believe the world is full of uncertainty so that we can never know the truth (which lies somewhere unknown in the middle), please give Bayes a serious thought. 3. To learn more about Bayesian theory, I recommend the book Doing Bayesian Data Analysis: A Tutorial with R and BUGS written by John K. Kruschke. That book gives a good introduction to R as well. 8

(1) Introduction to Bayesian statistics

(1) Introduction to Bayesian statistics Spring, 2018 A motivating example Student 1 will write down a number and then flip a coin If the flip is heads, they will honestly tell student 2 if the number is even or odd If the flip is tails, they