Bayesian Inference. p(y)

Similar documents
PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.

Hierarchical Models & Bayesian Model Selection

ECE521 W17 Tutorial 6. Min Bai and Yuhuai (Tony) Wu

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Computational Perception. Bayesian Inference

Introduction to Bayesian inference

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Introduction to Probabilistic Machine Learning

CSC321 Lecture 18: Learning Probabilistic Models

Lecture 3. Univariate Bayesian inference: conjugate analysis

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Chapter 5. Bayesian Statistics

The binomial model. Assume a uniform prior distribution on p(θ). Write the pdf for this distribution.

Parameter Learning With Binary Variables

Introduction into Bayesian statistics

Probability and Estimation. Alan Moses

PMR Learning as Inference

Foundations of Statistical Inference

COS513 LECTURE 8 STATISTICAL CONCEPTS

A primer on Bayesian statistics, with an application to mortality rate estimation

Bernoulli and Poisson models

Conjugate Priors, Uninformative Priors

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Bayesian Analysis (Optional)

Computational Cognitive Science

Compute f(x θ)f(θ) dθ

Introduction to Bayesian Inference

Computational Cognitive Science

HPD Intervals / Regions

Intro to Bayesian Methods

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

PHASES OF STATISTICAL ANALYSIS 1. Initial Data Manipulation Assembling data Checks of data quality - graphical and numeric

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Bayesian Analysis for Natural Language Processing Lecture 2

One-parameter models

Part 2: One-parameter models

Bayesian Inference. Chapter 1. Introduction and basic concepts

Bayesian analysis in finite-population models

Advanced Probabilistic Modeling in R Day 1

Time Series and Dynamic Models

Discrete Binary Distributions

Bayesian RL Seminar. Chris Mansley September 9, 2008

Introduction to Machine Learning

10. Exchangeability and hierarchical models Objective. Recommended reading

(3) Review of Probability. ST440/540: Applied Bayesian Statistics

Lecture 2: Priors and Conjugacy

A Brief Introduction to Bayesian Statistics

Bayesian Inference and MCMC

Some Asymptotic Bayesian Inference (background to Chapter 2 of Tanner s book)

Conjugate Priors: Beta and Normal Spring 2018

Introduction to Bayesian Statistics

Bayesian Inference. Introduction

Bayesian Models in Machine Learning

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

Conjugate Priors: Beta and Normal Spring 2018

Beta statistics. Keywords. Bayes theorem. Bayes rule

Markov Chain Monte Carlo methods

(4) One-parameter models - Beta/binomial. ST440/550: Applied Bayesian Statistics

Comparison of Bayesian and Frequentist Inference

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

COMP90051 Statistical Machine Learning

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

ST 740: Model Selection

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

A Discussion of the Bayesian Approach

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Contents. Decision Making under Uncertainty 1. Meanings of uncertainty. Classical interpretation

Bayesian model selection: methodology, computation and applications

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Quantitative Understanding in Biology 1.7 Bayesian Methods

2016 SISG Module 17: Bayesian Statistics for Genetics Lecture 3: Binomial Sampling

an introduction to bayesian inference

Lecture 2: Conjugate priors

CS 361: Probability & Statistics

Bayesian vs frequentist techniques for the analysis of binary outcome data

Bayesian Reliability Demonstration

Classical and Bayesian inference

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Bayesian Machine Learning

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Naïve Bayes classification

2 Belief, probability and exchangeability

Choosing Priors Probability Intervals Spring 2014

BEGINNING BAYES IN R. Bayes with discrete models

STAT J535: Introduction

Stats 579 Intermediate Bayesian Modeling. Assignment # 2 Solutions

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

Bayesian philosophy Bayesian computation Bayesian software. Bayesian Statistics. Petter Mostad. Chalmers. April 6, 2017

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Bayesian SAE using Complex Survey Data Lecture 1: Bayesian Statistics

Supplementary information

(1) Introduction to Bayesian statistics

Transcription:

Bayesian Inference There are different ways to interpret a probability statement in a real world setting. Frequentist interpretations of probability apply to situations that can be repeated many times, e.g., tossing a coin, or administering a treatment to a large cohort of patients. In this setting, a probability statement relates to the frequency with which an event occurs in the limit, e.g., how often the coin will land heads up, or what proportion of patients taking the treatment will respond favourably. An alternative approach is to view probability as a rational numerical expression of belief and uncertainty. In this setting, an events that we have a high degree of belief to be true (false) will have a probability close to (0). If we have a higher degree of belief in one event than another, then its corresponding probability will be higher. In this context we can assign probabilities to events that do not occur repeatedly, or that we are unable to observe, e.g., the probability that Shakespeare was the author of the plays attributed to him, or that an important part of a plane s engine will fail while it is being used for a commercial flight. This latter interpretation of probability is associated with Bayesian inference. Bayes s rule Consider a statistical model for data y with parameters θ. When we perform inference in a Bayesian context, we consider both y and θ to be random variables. We represent our beliefs about the possible values that these variables can take using the joint distribution p(θ, y) = p(y θ)p(θ). Here p(y θ) refers to the likelihood function, while p(θ) is the prior distribution of the parameters. We can interpret p(θ) as our beliefs about the model parameters before any data is observed. If we observe some data y, then we can then condition our beliefs about θ by using Bayes s rule to obtain the posterior distribution: p(θ y) = p(y θ)p(θ). p(y) In words, we update our beliefs about θ, conditional on the value of y, with these beliefs expressed in the form of a probability distribution. In practice, the marginal distribution of the data, p(y) = p(y, θ)dθ is often Θ difficult to compute exactly. Instead we will usually focus on the unnormalised posterior, p(θ y) p(y θ)p(θ). Either the unnormalised posterior will have a recognisable shape, or we will resort to computational methods to determine the normalised distribution of the parameter, as we will see later.

For many years, the inclusion of the prior distribution p(θ) in the inference process was a source of controversy and considerable debate within the statistics community, the argument being that it made inference subjective and unreliable. From a practical perspective, and broadly speaking, when there is a large sample of data, the amount of information in the likelihood overwhelms that in the prior, so that very extreme prior beliefs would be needed to meaningfully influence the inference procedure. On the other hand, when data is scarce, such as in some reliability applications, often expert opinion is required before meaningful probability statements can be made. In general, the utility of Bayesian methods has now been widely demonstrated, and the use of such methods is accepted by the wider scientific community. Beta-binomial model Suppose that we observe n trials of a binary outcome y = y,..., y n, where each y i B(, θ). Then the likelihood for the data takes the form p(y θ) = n i= ( θ) θyi yi = θ n i= yi ( θ) n n i= yi. We can specify a Beta(a, b) distribution as a prior: p(θ a, b) = c(a, b)θ a ( θ) b. Here c(a, b) = Γ(a + b) = 0 θa ( θ) b dθ Γ(a)Γ(b), which we will take as a known result from calculus, and where Γ(x) denotes the gamma function, which is defined for any number x > 0. We call a and b the hyperparameters for the prior distribution. Typically, these values must be chosen. A simple argument (in fact, something of a cop out) for choosing a and b is to note that setting a = b = means that p(θ a, b) = for any value of θ. Hence this is often interpreted as a non-informative prior for θ. Regardless of the choice of hyperparameters, we can combine the prior and likelihood together so that p(θ y, a, b) p(y θ)p(θ a, b) = θ n i= yi ( θ) n n i= yi c(a, b)θ a ( θ) b θ n i= yi+a ( θ) n n i= yi+b = θ a ( θ) b. Here we have ignored any terms that do not involve θ, and then used some algebra to tidy up the expression on the right hand side of the equation. It remains to identify a normalising constant for the posterior p(θ y, a, b). You should be able to recognise that the shape of p(θ y, a, b) is the same as the prior p(θ a, b), but with parameters a and b. This means that p(θ y, a, b ) follows a beta distribution, with parameters a = n i= y i+a and b = n n i= y i + b. 2

4 n=0 n=20 3 distribution 2 0 4 3 2 Be(,) prior Be(3,2) prior type posterior prior 0 0.00 0.25 0.50 0.75.00 0.00 0.25 0.50 0.75.00 p Figure : Examples of the posterior distribution of a beta distribution with different sample sizes and hyperparameters. Inspecting the posterior parameters terms of the distribution, we can interpret the updated terms a and b as a combination of our prior knowledge (in the form of the hyperparameters a and b) and summary statistics of the observed data ( y i and n n i= y i). We can interpret the value of a relative to b as a reflection of our prior belief in the number of successes relative to failures we would expect to observe. The value a + b is a reflection of our certainty of these beliefs; larger values indicate higher certainty. If the sample size n is much larger than a + b, then the hyperparameters will have relatively little effect on the posterior distribution. Some examples of fthe posterior distribution of a beta distribution with different sample sizes (n = 0, 20)and hyperparameters ((a =, b = ), (a = 3, b = 2)) are shown in Figure. 3

982 y = 438 i= 25 20 distribution 5 0 type posterior prior 5 0 0.00 0.25 0.50 0.75.00 p Figure 2: Example: posterior distribution of θ, the probability of a female birth in the case of placenta previa, with a Beta(,) prior. Example: estimating the probability of female birth given placenta previa The following example is taken from [2]. We consider the sex ratio of births for which the maternal condition placenta previa occurred. This is an unusual condition of pregnancy that prevents a normal delivery from occuring. A study concerning the sex of placenta previa births in Germany recorded that 437 of a total of 980 births were female. How much evidence does this provide for the claim that the proportion of female births in the population of placenta previa births is less than 0.5? If we adopt a uniform Beta(,) prior, then the posterior distribution for θ, the probability of a female birth in the case of placenta previa, is Beta(438,544). This distribution is visualised in Figure 2. The red line indicates the posterior distribution, and the green line the prior. In this case we have a large number of observations, so the curves are very distinct. A dashed line indicates the value θ = 0.5. Clearly, in this case the majority of the mass of the distribution is to the right of 0.5. Using the code in R, we can compute the probability P(θ < 0.5) = 0.99965. Conjugacy We have shown that a beta prior and a binomial likelihood leads to a beta posterior, which makes inference easier (we didn t need to do any integration ourselves, and instead could use a known result). We say that the beta prior is conjugate for the binomial distribution. More generally, we say that a class P of prior 4

(a) (b) Figure 3: Graphical diagrams of a beta distribution. The second diagram uses plate notation to represent the data. distributions for θ is conjugate for a likelihood function p(y θ) if We will exploit this convenient property many times. p(θ) P p(θ y) P. Graphical diagram We can represent the data generation for a beta-binomial model as follows: θ Beta(a, b); Y i θ Binomial(, θ), for i =,..., n. This is a reflection of the conditional distribution of the posterior, i.e., p(θ y, a, b) p(y θ)p(θ a, b). We can also represent this model using a graphical diagram. See Figure 3. This a graph, consisting of nodes and edges. The nodes represent the parameters and data of the model. The edges connect the nodes, and represent the dependence between parameters. The direction of the edges indicate the nature of the depence between the parameters. Figure 3a shows each datapoint y,..., y n separately. Figure 3b shows the same model but more concisely; the data y are collectively represented using a plate diagram. In both figures, the node representing y is shaded, denoting that this quantity is observed. The nodes representing a and b are boxes, denoting that they are hyperparameters and are specified by the analyst. The node representing θ is transparent and circular, which means that it is a quantity to be inferred. 5

References [] P.D. Hoff, A first course in Bayesian statistical methods, Chapter 3. Springer, 2009. [2] A. Gelman, J. B. Carlin, H.S. Stern, D.B. Rubin, Bayesian data analysis, 2nd edition. Chapter 2. Chapman & Hall/CRC, 2004. 6