Bayesian philosophy Bayesian computation Bayesian software. Bayesian Statistics. Petter Mostad. Chalmers. April 6, 2017

Similar documents
Probabilistic Machine Learning

Introduction to Bayesian Statistics 1


PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.

Bayesian Inference and MCMC

2 Inference for Multinomial Distribution

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

HPD Intervals / Regions


(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Bayesian Inference: Posterior Intervals

Chapter 5. Bayesian Statistics

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Beta statistics. Keywords. Bayes theorem. Bayes rule

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

A Very Brief Summary of Bayesian Inference, and Examples

Bayesian Statistics. Debdeep Pati Florida State University. February 11, 2016

Part III. A Decision-Theoretic Approach and Bayesian testing

ST 740: Model Selection

Principles of Bayesian Inference

Principles of Bayesian Inference

Markov Chain Monte Carlo methods

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Principles of Bayesian Inference

Computer intensive statistical methods

Computer intensive statistical methods

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017

BUGS Bayesian inference Using Gibbs Sampling

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Markov Chain Monte Carlo (MCMC)

Bayes: All uncertainty is described using probability.

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

INTRODUCTION TO BAYESIAN STATISTICS

Advanced Statistical Modelling

Bayesian inference: what it means and why we care

Bayesian Computation

Confidence Intervals. CAS Antitrust Notice. Bayesian Computation. General differences between Bayesian and Frequntist statistics 10/16/2014

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Reminder of some Markov Chain properties:

A Bayesian Approach to Phylogenetics

36-463/663Multilevel and Hierarchical Models

36-463/663: Hierarchical Linear Models

Bayesian Prediction of Code Output. ASA Albuquerque Chapter Short Course October 2014

(4) One-parameter models - Beta/binomial. ST440/550: Applied Bayesian Statistics

Discrete Binary Distributions

STAT 425: Introduction to Bayesian Analysis

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Introduction to Bayesian Methods

Machine Learning using Bayesian Approaches

Bayesian Networks in Educational Assessment

Bayesian Inference. Chapter 1. Introduction and basic concepts

STA 294: Stochastic Processes & Bayesian Nonparametrics

McGill University. Department of Epidemiology and Biostatistics. Bayesian Analysis for the Health Sciences. Course EPIB-675.

Bayesian RL Seminar. Chris Mansley September 9, 2008

eqr094: Hierarchical MCMC for Bayesian System Reliability

Two examples of the use of fuzzy set theory in statistics. Glen Meeden University of Minnesota.

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

ComputationalToolsforComparing AsymmetricGARCHModelsviaBayes Factors. RicardoS.Ehlers

Part 8: GLMs and Hierarchical LMs and GLMs

Introduc)on to Bayesian Methods

Computational Perception. Bayesian Inference

Data Analysis and Uncertainty Part 2: Estimation

CS 361: Probability & Statistics

Bayesian statistics, simulation and software

ECE521 W17 Tutorial 6. Min Bai and Yuhuai (Tony) Wu

The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for Bayesian Estimation in a Finite Gaussian Mixture Model

Lecture 2: Conjugate priors

PIER HLM Course July 30, 2011 Howard Seltman. Discussion Guide for Bayes and BUGS

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

CS540 Machine learning L9 Bayesian statistics

Comparison of Three Calculation Methods for a Bayesian Inference of Two Poisson Parameters

19 : Slice Sampling and HMC

The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations

Review: Statistical Model

Metropolis-Hastings Algorithm

CS 340 Fall 2007: Homework 3

Principles of Bayesian Inference

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Bayesian Inference. p(y)

A short diversion into the theory of Markov chains, with a view to Markov chain Monte Carlo methods

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Bayesian Analysis of RR Lyrae Distances and Kinematics

Hierarchical Models & Bayesian Model Selection

Lecture 2: Priors and Conjugacy

Fundamental Probability and Statistics

Spatial Statistics Chapter 4 Basics of Bayesian Inference and Computation

Topic 16 Interval Estimation. The Bootstrap and the Bayesian Approach

COS513 LECTURE 8 STATISTICAL CONCEPTS

Foundations of Statistical Inference

Lecture 6: Markov Chain Monte Carlo

Bayesian Meta-analysis with Hierarchical Modeling Brian P. Hobbs 1

Monte Carlo-based statistical methods (MASM11/FMS091)

Inference for a Population Proportion

Bayesian Methods for Machine Learning

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Bayesian Regression Linear and Logistic Regression

A short introduction to INLA and R-INLA

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Bayesian inference for factor scores

Compute f(x θ)f(θ) dθ

Transcription:

Chalmers April 6, 2017

Bayesian philosophy

Bayesian philosophy Bayesian statistics versus classical statistics: War or co-existence? Classical statistics: Models have variables and parameters; these are conceptually different. Variables represent (potential) data. Parameters are assumed to have a FIXED but UNKNOWN value. Thus, models for unrepeatable events are meaningless. Procedures to compute the parameters from data are judged by their properties when applied to potential new data. Models with estimated parameters inserted yield predictions. Bayesian statistics: Models have only variables; their distribution represent (some person s) KNOWLEDGE about the some part of the world. Models for unrepeatable events are meaningful. Models give predictions of (relative) probabilities for data, even before data is observed. Predictions for new observations are made from models conditional on old data.

Example A sequence of independent and equivalent trials is performed, each resulting in success (1) or failure (0). The following data is observed: 0, 1, 0, 0, 1, 0, 0, 1. Classical analysis: A possible model is a Binomial distribution, with probability of success p and x out of 8 trials observed as successes. A possible estimator for p is ˆp = x/8. One can show this estimator is unbiased, i.e., E [ˆp] = p. With our data, we get ˆp = 3/8. Plugging into model, we compute the probability 0.062 that 4 of the next 5 trials will be successes. Another possible model is a negative Binomial distribution, where y is the number of of trials needed to observe 3 successes. A possible estimator for p is ˆp = 3/y. This estimator for p has a different distribution. For example, it is biased, i.e., E [ˆp] p. One might instead use the minimum variance unbiased estimator for p, and get ˆp = 3 1 8 1 = 2/7. But this would yield the probability 0.024 that 4 of the next 5 trials will be successes.

Example, continued Assume we want to do a hypothesis test where H 0 : p 0.6, while H 1 : p < 0.6. What will the p-value be? The answer depends on which test statistic we use. Recall that the p-value is the probability, assuming H 0 and generating new data, of observing something equally or more extreme than the given test statistic, in terms of rejecting H 0. One possibility is the test statistic x, the number of successes in 8 trials. The probability of observing 0, 1, 2, or 3 successes when p = 0.6 is 0.174, so the p-value is 0.174. Another possibility is the test statistic y, the number of trials needed to observe 3 successes. The probability of needing 8 or more trials when p = 0.6 is 0.095.

Example, continued In the classical analysis, answers depend on choice of estimator or test statistic. However, they do not depend on the context. Consider the following contexts: 8 tosses of a coin gives 3 heads. 8 tests of a new medical procedure leads to 3 fatalities. Controlling 8 items produced in a factory uncovers 3 faulty items. In real life, the predicted probability of 4 successes in the next 5 trials would be different in these three contexts. In a Bayesian analysis, the different contexts would be taken into account by formulating a prior probability distribution for p, indicating the prior knowledge: For the coin example, we might use p Beta(20, 20). For the medical example, studies of similar medical procedures might yield p Beta(2, 6). For the factory example, knowledge gained in similar testing might be formulated with p Beta(1, 10).

Digression: The Beta distribution θ has a Beta distribution on [0, 1], with parameters α and β, if its density has the form 1 π(θ α, β) = B(α, β) θα 1 (1 θ) β 1 where B(α, β) is the Beta function defined by B(α, β) = Γ(α)Γ(β) Γ(α + β) where Γ(t) is the Gamma function defined by Γ(t) = 0 x t 1 e x dx Recall that for positive integers, Γ(n) = (n 1)! = 0 1 (n 1). See for example Wikipedia for more properties of the Beta distribution, and the Beta and Gamma functions. We write π(θ α, β) = Beta(θ; α, β) for the Beta density.

Example, continued The Bayesian model consists of the appropriate prior for p, and conditionally on each such p, a model for the data: It could be either a Binomial model for x or a negative Binomial model for y. Thus, the Bayesian model is a bivariate probability distribution; there is no conceptual difference between the variable for observed data (be it x or y) and p. Before the data is observed, the probability of observing just this data can be computed using the marginal distribution of x (or y) computed from the bivariate distribution representing the model.

Example, continued The knowledge about p after considering the data can be computed as the conditional distribution when we fix the data. This is called the posterior distribution for p. Note that there is no need to make a subjective choice of an estimator for p. Crucially, the posterior will be the same whether we use a Binomial model or a negative Binomial model for the data. Let z be the number of successes in 5 new trials. Given the posterior distribution for p, we get a bivariate model for p and z by multiplying with a Binomial distribution for z, with 5 trials and probability of success p. The distribution of z can be computed as the marginal distribution over this model. Note that predictions about z will not depend on whether we used a Binomial or negative Binomial distribution for the data.

, simplest example s can in fact always be performed by multiplying probability densities or functions, and taking conditional or marginal distributions. Example: An archeological item could be from either of three areas, A, B, or C. Based on visual inspection, it is judged to be from A, B, or C with probabilities 0.2, 0.5, and 0.3, respectively. Now a chemical analysis is done to detect two trace elements, X and Y. We know that the probabilities of detecting combinations of these trace elements, given the item s origin, is given in the table below: Both X and Y X only Y only None A 0.1 0.7 0.1 0.1 B 0.6 0.1 0.2 0.1 C 0.1 0.1 0.1 0.7

, simplest example, continued How can we answer for example questions like What is the probability that the item is from A, given that X only is detected? The table below represents the joint distribution: Both X and Y X only Y only None A 0.02 0.14 0.02 0.02 0.2 B 0.30 0.05 0.10 0.05 0.5 C 0.03 0.03 0.03 0.21 0.3 0.35 0.22 0.15 0.28 All questions can be answered by computing coditional or marginal distributions from the table above. For example, Pr(A X ) = 0.14/0.22 = 0.636

Computations in Beta-Binomial example Let s choose one of the priors: Assume p Beta(2, 6). The probability density becomes π(p) = 1 B(2,6) p2 1 (1 p) 6 1. If we use that x has a Binomial distribution with 8 trials and parameter p, we get the probability function π(x p) = ( 8 x) p x (1 p) 8 x. The joint model becomes π(x, p) = π(x p)π(p) = ( ) 8 x p x (1 p) 8 x 1 B(2,6) p1 (1 p) 5. We would like to compute π(p x) = π(x,p) π(x) with x fixed to the value 3. Note that, as a function of p, this must be proportional to p 4 (1 p) 10. Note also that the Beta distribution with parameters 5 and 11 has a density proportional to p 4 (1 p) 10. Thus these two densities must be identical! We get. π(p x = 3) = Beta(p; 5, 11)

Computations in previous example, continued More generally, if we had used the prior Beta(p, α, β) for p, we would get the posterior Beta(p; α + 3, β + 5). Note that, if we had chosen to use data y with a negative Binomial distribution, we would have π(y p) = ( ) y 1 3 (1 p) y 3 p 3, and one can check that the posterior for p would become the same. The possible new data z has a Binomial distribution with 5 trials and parameter p. Multiplying this probability function with the posterior density found above, we get the joint distribution for z and p given the data. We can now compute π(z) = π(z p)π(p) π(p z) = ( 5 z) 1 Beta(5,11) 1 Beta(5+z,16 z) = ( ) 5 Beta(5 + z, 16 z). z Beta(5, 11) Thus we get that the probability of 4 successes in 5 new trials is π(z = 4) = 0.04966.

More advanced More generally, let x be a vector representing the data, and let θ be a vector representing the variables of interest. Assume we can write down the probability (density) function π(x θ), and the prior π(θ). Then the posterior for the parameter θ is then given by Bayes formula π(θ x) = π(x θ)π(θ) π(x) = π(x θ)π(θ) π(x θ)π(θ) dθ θ π(x θ)π(θ) where π(x) is the marginal probability (density) for x. Note the notation using θ : If we only know the posterior π(θ x) up to a factor not depending on θ, it can be reconstructed by requiring the sum (or integral) to be 1. Thus, in order to do inference, i.e., compute the posterior distribution of θ, we only need to compute the distribution for θ whose density is proportional to π(x θ)π(θ).

Computational methods for the posterior When all variables are finite-valued, there are algorithms for exact efficient computations, even when the distributions π(θ) and π(x θ) are expressed in terms of a network of dependent variables. The computations in the second example above work out (fairly) easily because we chose as the prior for p a distribution that is conjugate to the Binomial distribution (or negative Binomial) for the data. With enough conjucacies, one can also obtain exact posteriors. In all other cases, one can only compute approximations of the posterior. The group of methods called Markov chain Monte Carlo (McMC) are by far the most general and popular approximation methods. There are some other approximative algorithms, for example INLA (Integrated Nested Laplace Approximation), but they can be applied to more limited sets of models.

Markov chain Monte Carlo The idea is to generate an (approximative) sample from the posterior. Then, inference can be done based on this sample. The sample is produced using a Markov chain. The chain is produced by * starting at some fairly random value θ 0, * for each step, generating a new proposed value from the old, using some algorithm, and * accepting or rejecting the proposed value based on an acceptance criterium. The acceptance criterium depends on the posterior distribution π(θ x), but it needs to be known only up to a constant. This fits our situation perfectly. The distribution of the chain converges to the correct distribution, but the convergence may be slow. The chain may also have autocorrelation.

Checking convergence The simplest is to monitor the series of values of a variable. Does the pattern seem to stabilize? A slightly more advanced method is to use several parallell Markov chains with independent starting points. If convergence is reached, the range of values spanned by all chains should be the same as the range of values spanned by each chain; otherwise it is larger. This is measured by a quantity called R, and estimated by ˆR. If ˆR goes down towards 1, this indicates convergence. High autocorrelation means that the chain moves very slowly. This will also indicate slow convergence.

Improving convergence A popular type of McMC is Gibbs sampling. Each proposal changes only one of the variables in the variable vector, and the proposal is based on the conditional distribution of this variable given all the others. Gibbs sampling often works great, and is easy to implement. However, for highly correlated variables, convergence can be too slow. General methods to improve convergence speed exist. But often, the most efficient is to look carefully at the shape of your distribution, and choose a proposal function adapted to it.

Using the sample for inference Given a sample from a distribution, all properties of the distribution can in fact be estimated from this sample. For example given a sample of size 10.000 of a variable, you can estimate a 95% credibility interval (i.e., an interval that covers 95% of the probability density) by finding the 250 th and the 9750 th values in the ordered set. In R, use quantile. To estimate the expectation of any function f of the variable θ, simply compute f (θ 1 ), f (θ 2 ),..., f (θ 10000 ) and take their average.

Most statisticians use both frequentist and Bayesian methods, so a large proportion of software available, also R packages, use some Bayesian ideas. When models are Bayesian Networks with finite-values variables (or only normally distributed variables), algorithms for exact inference are available in programs like Hugin (commercial) or GeNIe (free). There are a few general-purpose programs for models formulated as a Bayesian Network. The most famous and oldest is BUGS, which exists in a number of incarnations (WinBUGS, OpenBUGS). It implements Gibbs sampling, basically. It can be accesseed from R via a number of different R packages, e.g., R2OpenBUGS, brugs, etc. etc. Some more modern general-purpose programs exist, most notably JAGS and STAN. They implement improvements to the algoritmns of BUGS that in general increase convergence speed.