Bayesian Statistics. State University of New York at Buffalo. From the SelectedWorks of Joseph Lucke. Joseph F. Lucke

State University of New York at Buffalo From the SelectedWorks of Joseph Lucke 2009 Bayesian Statistics Joseph F. Lucke Available at: https://works.bepress.com/joseph_lucke/6/

Bayesian Statistics Joseph F. Lucke Research Institute for Addictions State University of New York at Buffalo Original Presentation: July 29, 2009 Latest Revision: October 19, 2011

What is Bayesian Analysis? Bayesian Statistical Theory (BST) is radically different paradigm for statistics BST is not just another class of statistical models like structural equation models or multilevel models. BST can analyze any statistical model. Even though the numbers may be the same as classical theory, the interpretation is different. Issues that trouble classical statistics (e.g., multiple comparisons, sample size adequacy, interpretation of p-values) disappear in BST.

Why Consider Bayesian Analyses? Pragmatic Reasons Solve more statistical problems Implement more realistic models Less concern with sample size issues Philosophical Reasons Conceptual difficulties with classical statistics Closer integration of probability theory with statistical methods Unified approach to statistics

Interpretations of Probability Probability and hence Statistics are not monolithic disciplines. Probability theory began around 1660. Probability has always had ambiguous interpretations: Mathematical Not interpreted, branch of Measure Theory Objective Relative frequencies Subjective Logic of opinion

Historical Chronology Ambiguous: Pascal (1654), Bayes (1763) Subjective: Laplace (1814) Dormancy: (1830 1880) Objective: Venn (1888), von Mises (1964) Mathematical: Kolmogorov (1933) Subjective: Ramsey (1926), de Finetti (1937)

Bayes? (1702 1761)

Laplace (1749 1827)

Ramsey (1903 1930)

de Finetti (1906 1985)

Current Schools of Statistics Classical: Neyman-Pearson, Wald, Lehmann significance, power, decision between hypotheses Likelihood: Fisher, Royall accept/reject hypothesis, p-value Bayesian: Jeffrey, DeGroot prior & posterior distributions, Bayes factor.

Comparison of Bayes and Neyman-Pearson Theories Feature Bayes Neyman-Pearson Content Beliefs Acts Unifying Principle Coherence Inductive behavior Probability Subjective Objective Repeated Events Exchangeability Independence Data Fixed Random Parameters Random Fixed, unknown Inference Bayes s Theorem Estimation Confidence interval Fixed Random Hypothesis testing Bayes factor Significance, power

Comments on Probability Probability is only the expression of our ignorance of the true causes. (Laplace, 1814) There are no such things as objective chances... Chances must be defined by degrees of belief. (Ramsey, 1931 ) PROBABILITY DOES NOT EXIST! (de Finetti, 1972)

Subjective Probability Probability is the logic of judgment or opinion. Your opinion can be represented as a set of subjectively fair bets on an event. Events may be unique. No repetition is required. Coherence principle: Avoid sets of bets that entrain a guaranteed loss. Ramsey-de Finetti Theorem (1931): Coherent judgments must satisfy probability axioms.

Coherence Suppose bet Republican choices of presidential candidate in 2012. Events Romney Perry Cain Odds Against 1 W 1 2 W 1 8 W 1 Bets ($) 9 6 2 Romney 9 C6 C2 Perry C9 12 C2 Cain C9 C6 16 No matter what the outcome, I lose $1. My judgments are incoherent. Corresponding probabilities: 1 2 C 1 3 C 1 9 D 17 18 < 1.

Exchangeability Finite case The concept of exchangeability is the subjectivist s equivalent to random sampling. Given an urn of 10 (N ) variously colored balls (events), the collection is exchangeable if any sample of size n < 100 (n < N ) is has the same distribution as any other sample of size n. The drawing of any single ball has the same information as the drawing of any other single ball. The drawing of any pair of balls has the same information as the drawing of any other pair. The drawing of any n-tuple of balls has the same information as the drawing of any other n-tuple.

Exchangeability Infinite case An infinite collection of events is infinitely exchangeable if any finite sample, no matter how large, is exchangeable. DeFinetti s Representation Theorem (1937): If a (potentially) infinite collection is infinitely exchangeable, then the collection can be modeled as if the collection consisted of independent events dependent upon some parameter. Infinite exchangeability can be approximated by (finite) exchangeability.

Exchangeability Examples If I flip a coin, and I consider each set of flips as being equally informative as any other set of flips, and I am willing to consider any number of flips, no matter how large, then the flips of the coin can be represented as if they were independent flips given the tendency of the coin to come up heads. Arbuthnott (1710) in 82 years (1629 to 1710) of annual birth records observed 484,382 male births of 938,223 total births from (82 years). If this sample is considered infinitely exchangeable, then the births can be considered independent with a male birth rate of 51.63%. Infinite exchangeability can be approximated by (finite) exchangeability.

Implications for Statistics 1 Use all of probability theory 2 Parameters can be considered uncertain with a (subjective) probability distribution 3 Parameter uncertainty is reduced by observations via Bayes s Theorem

Bayes s Theorem 1 Let be a parameter controlling a model. 2 Let Pr./ be the prior probability of. 3 Let x be an observation. 4 Let Pr.xj/ be the data-generating mechanism describing how the observations probabilistically arise from the model controlled by the parameter. 5 Let Pr.x/ be the prior predictive distribution. 6 Let Pr. jx/ be the posterior distribution of the parameter given the data x. 7 Then Bayes s Theorem is. Pr.jx/ D Pr.xj/ Pr./ Pr.x/

Bayes s Theorem is Trivial Bayes s Theorem is a trivial theorem in modern probability theory. Proof: Pr.jx/ Pr.x/ D Pr.; x/ D Pr.xj/ Pr./. Interpretation: The probability of two events happening is the probability of the first event times the probability of the second event given that the first has happened. Whether or not you can use Bayes s Theorem for statistical inference is determined by your interpretation of probability as a logic of opinion or relative frequency.

Bayes s Theorem, Example 1 Assume there are two urns, each with 100 balls. 2 Urn 1 has 90 black balls and 10 white 3 Urn 2 has 30 black balls and 70 white 4 So Pr.black ball/ D D :90 or D :30 5 Let prior probability of be Pr. D :90/ D :1 and Pr. D :30/ D :9. 6 Assume we draw a black ball. 7 What the probability that the ball came from Urn 1 ( D :9) versus Urn 2 ( D :3)?

Bayes s Theorem, Example continued 1 Let x D 1 mean the ball is black and x D 0 mean the ball is white. 2 The data generating mechanism is Bernoulli: x.1 / 1 x. 3 The prior predictive probability is Pr.x D 1/ D :36. 4 5 Pr. D :9jx D 1/ D Pr.x D 1j D :9/ Pr. D :9/= Pr.x D 1/ and D :9 :1=:36 D :25 Pr. D :3jx D 1/ D Pr.x D 1j D :3/ Pr. D :3/= Pr.x D 1/ D :3 :9=:36 D :75:

Predictive Distributions Bayesian analysis uses predictive probabilities for observations. A predictive probability of an outcome is its probability weighted by the prior probabilities of the models generating the outcome. The prior predictive probability for x D 1 is Pr.x D 1/ D Pr.x D 1j D :9/ Pr. D :9/ C Pr.x D 1j D :3/ Pr. D :3/ D :9 :1 C :3 :9 D :36:

Predictive Distributions, continued More important than the prior predictive probability is the posterior predictive probability of a new, not-yet-observed outcome. The posterior predictive probability of a new outcome is its probability weighted by the posterior probabilities of the models generating the outcome. The posterior predictive probability for a new Ox D 1 is Pr. Ox D 1/ D Pr. Ox D 1j D :9/ Pr. D :9jx D 1/ C Pr. Ox D 1j D :3/ Pr. D :3jx D 1/ D :9 :25 C :3 :75 D :45:

Sources of Priors: 1 of 4. One of the biggest problems in using Bayesian statistics is the requirement of a prior probability for the parameter(s). Personal What you believe in your heart of hearts; Often requires a tedious process of elicitation; Usually difficult to incorporate in statistical model; Expert Determined by knowledgeable expert; My priors are the expert s priors;

Sources of Priors: 2 of 4. Scientific Consensus Reflects beliefs of scientific community; Tempered allow some credibility for alternative hypotheses; Adversarial Priors Choose prior favoring adversary s position Multiple priors; Used in sensitivity analyses;

Sources of Priors: 3 of 4. Previous data Previous results can be summarized as a prior for the current study. Prior studies are often heterogeneous and only partially relevant. Prior studies do not provide summaries as posterior distributions. Use discounted priors based on posteriors of previous results. Theory Determined by scientific theory; Possibly only partially determined;

Sources of Priors: 4 of 4. Conjugate Technically simple but flexible; Do not require substantial computation; Posterior distribution is in same class as prior. Reference Minimal influence on data; Try to express ignorance regarding parameters; Usually are conjugate priors;

Swamping of Priors Dogmatic prior Little or no uncertainty: ll prior probability concentrated on very small interval or single point. Open-minded prior moderate or large amount of uncertainty: Probability no concentrated. Sufficient data will swamp an open-minded prior. Bayesian Central Limit Theorem : In most cases, with sufficient data, the posterior distribution of a parameter will have an approximately normal distribution.

Posterior Analyses The posterior distribution contains all the information regarding the impact of the observed data on the model parameters. Analyses and summaries of the posterior are used to convey the impact of the data on the model parameters.

Posterior Analyses Location First summary usually refers to location. Posterior mean. Posterior median or mode.

Posterior Analyses Spread Extremely important is a measure of uncertainty regarding the parameter value. Usually supplied by the 95% (or other %) credible interval. The credible interval is fixed and the parameter is random In contrast, a confidence interval is random and the parameter is fixed. A (random) parameter has probability.95 of falling within a (fixed) 95% credible interval. A (random) 95% confidence interval has probability of covering the (fixed) unknown parameter.

Bayesian Analysis: Binary Outcome Arbuthnot (1710) investigated the birth rates of males and females in London. He was particularly concerned whether the male birth rate exceeded :5. Here we will conduct a Bayesian analysis. Let denote the male birth rate. Note that 0 1. Let the prior for be beta.1; 1/. This means all birth rates are equally likely. Not realistic, but replicates classical analysis.

Binary Outcomes, cont d Stage Males Females E./ 95%BCI Pr. > :50/ prior (1) (1).500 (.025,.975).50 1 day 15 14.517 (.339,.694).57 2 days 28 26.519 (.386,.649).61 1 week 101 91.517 (.455,.596).77 1 month 402 361.527 (.491,.562).93 1 year 5219 4684.527 (.517,.537) 1.00-

Prior 1 Day 0.6 0.8 1.0 1.2 1.4 0 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0 Proportion Male Births 0.0 0.2 0.4 0.6 0.8 1.0 Proportion Male Births 1 Week 1 Month 0 2 4 6 8 10 0 5 10 15 20 0.0 0.2 0.4 0.6 0.8 1.0 Proportion Male Births 0.0 0.2 0.4 0.6 0.8 1.0 Proportion Male Births

Bayesian Analysis: Comments Sequential sampling: Because the observation were considered exchangeable, Arbuthnot could update his priors from day to day using the previous day s posterior as the subsequent day s prior. Stopping rule: He could stop sampling whenever he wanted. Multiple inferences: The inferences are based solely on the observations obtained. There is no correction needed for having made multiple inferences.

Binary Outcomes, Informative Prior Now, consider an informative prior beta.50; 50/. Stage Males Females E./ 95%BCI Pr. > :50/ prior (50) (50).500 (.403,.600).50 1 day 15 14.504 (.417,.590).54 2 days 28 26.507 (.427,.586).56 1 week 101 91.517 (.460,.575).72 1 month 402 361.524 (.490,.557).92 1 year 5219 4684.527 (.517,.537) 1.00-

Prior 1 Day 0 2 4 6 8 0 2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 Proportion Male Births 0.0 0.2 0.4 0.6 0.8 1.0 Proportion Male Births 1 Week 1 Month 0 2 4 6 8 10 14 0 5 10 15 20 0.0 0.2 0.4 0.6 0.8 1.0 Proportion Male Births 0.0 0.2 0.4 0.6 0.8 1.0 Proportion Male Births

Classical hypothesis testing :PTCA vs Stent RCT for percutaneous transluminal coronary angioplasty (PTCA) versus provisional stenting (Stent) for increasing survival (Savage, 1997). Expected survival propensity for PTCA is :70. Want Stent condition to increase survival to at least :75. ı D :05, D :05, 1 ˇ D :80. Required sample size: 986 per group

Data Group Sample Survival Proportion PTCA 107 83.78 Stent 108 90.83 Classical test of proportions: 2.1/ D 0:80, p D :19, one-sided. Actual sample size is only 11% of required size What can we conclude? No difference? Unable to detect any difference? Maybe study should not have been conducted in the first place.

Bayesian hypothesis testing :PTCA vs Stent Same RCT for percutaneous transluminal coronary angioplasty (PTCA) versus provisional stenting (Stent) for increasing survival (Savage, 1997). Encode my information prior to observing data No difference between groups Expected survival propensity for either group is :70 95%BCI = (.40,.93) for each group

Prior Prior, both groups 0.0 0.5 1.0 1.5 2.0 2.5 mean =.70 95% BCI = (.40,.93) 0.0 0.2 0.4 0.6 0.8 1.0 Survival Propensity

Data Group Sample Survival Proportion PTCA 107 83.78 Stent 108 90.83 Pr.Surv j Stent/ D :82, 95%BCI = (.75.89) Pr.Surv j PCTA/ D :77, 95%BCI = (.69.84)

Posterior Stent =.82; BCI = (.75,.89) 0 2 4 6 8 10 PCTA =.77; BCI = (.69,.84) 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Survival Propensity

Bayesian Hypothesis testing : Difference between PTCA vs Stent Interested in difference between PCTA & Stent effects. Mathematical fact: Difference between two beta distributions is approximately normal. If more precision is require, one can alway simulate distribution.

Prior for Difference between PCTA & Stent 0 2 4 6 8 Prior mean =.00 Prior 95% BCI = (.38,.38) 0.2 0.1 0.0 0.1 0.2 0.3 0.4 Survival Difference

Posterior for Difference between PCTA & Stent 0 2 4 6 8 prior posterior Post mean =.05 Post 95% BCI = (.05,.15) Prior mean =.00 Prior 95% BCI = (.38,.38) 0.2 0.1 0.0 0.1 0.2 0.3 0.4 Survival Difference

Bayesian hypothesis testing 1 Let ı denote the difference between the Stent and PTCA propensities of survival. Consider Stent Superiority: ı > 0 Stent Inferiority: ı 0

Bayesian Hypothesis Testing 1 x 0.4 0.2 0.0 0.2 0.4 Inferior Equivocal Superior 0.4 0.2 0.0 0.2 0.4 x

Posterior for Difference between PCTA & Stent 0 2 4 6 Prob Inf =.15 Prob Sup =.85 0.2 0.1 0.0 0.1 0.2 0.3 0.4 Survival Difference

Bayesian hypothesis testing 2 Let ı denote the difference between the Stent and PTCA propensities of survival. Consider : Stent Superiority: :05 < ı Stent-PCTA Equivalence: :05 ı :05 Stent Inferiority: ı < :05 Stent Non-inferiority: ı :05 Stent Non-Superiority ı :05 Equivocal Results: :10 ı :10

Bayesian Hypothesis Testing 2 x 0.4 0.2 0.0 0.2 0.4 Not Superior Inferior Equivocal Equal Superior Not Inferior 0.4 0.2 0.0 0.2 0.4 x

Bayesian Hypothesis Testing 2 0 2 4 6 8 Prob Equiv =.45 Prob Inf =.02 Prob Sup =.53 0.2 0.1 0.0 0.1 0.2 0.3 0.4 Survival Difference

Modern Bayesian Analysis Before 1980, Bayesian analyses were severely limited to a few classes of models. The computational complexity of posterior distributions proscribed any serious analyses. Bayesian textbooks would start off with theory of Bayesian analyses but end up with practice of classical analyses. After 1980, Bayesian analyses are no longer so limited. A procedure called Markov chain Monte Carlo (MCMC) methods has liberated Baysian analyses from its complexity. MCMC is implemented in (free) software WinBUGS OpenBUGS.

How does MCMC work? MCMC is a simulation technique It takes each parameter of a model by fixing the data (which is always the same) and remaining parameters and generating a sampling distribution for that parameter. It then moves to the next parameter and generates a sampling distribution for that parameter. And so on through all parameters. Then it starts over with the first updated parameter and repeats the process. After a few thousand trial, the distributions of all the parameters converge. From these distributions, summaries of parameters can be obtained.

Thanks to Michael West (2004). http://www.isds.duke.edu/ ~mw/abs04/lecture_slides/4.stats_regression.pdf