The posterior - the goal of Bayesian inference

Similar documents
Linear Models A linear model is defined by the expression

Swarthmore Honors Exam 2012: Statistics

STAT J535: Introduction

Statistics of Small Signals

Parametric Techniques

Miscellany : Long Run Behavior of Bayesian Methods; Bayesian Experimental Design (Lecture 4)

A primer on Bayesian statistics, with an application to mortality rate estimation

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Contents. Decision Making under Uncertainty 1. Meanings of uncertainty. Classical interpretation

(1) Introduction to Bayesian statistics

Parametric Techniques Lecture 3

CS540 Machine learning L9 Bayesian statistics

Bayesian vs frequentist techniques for the analysis of binary outcome data

Bios 6649: Clinical Trials - Statistical Design and Monitoring

Bayesian Inference: Posterior Intervals

Introduction to Probabilistic Machine Learning

MCMC notes by Mark Holder

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Making rating curves - the Bayesian approach

Fitting a Straight Line to Data

Physics 509: Bootstrap and Robust Parameter Estimation

Using Microsoft Excel

Statistical Methods in Particle Physics

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Introduction into Bayesian statistics

Lecture 1: Bayesian Framework Basics

CS Homework 3. October 15, 2009

The subjective worlds of Frequentist and Bayesian statistics

Lecture 3: Probabilistic Retrieval Models

Bayesian Uncertainty: Pluses and Minuses

Robustness and Distribution Assumptions

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Principles of Bayesian Inference

Subject CS1 Actuarial Statistics 1 Core Principles

Statistical Tools and Techniques for Solar Astronomers

The scatterplot is the basic tool for graphically displaying bivariate quantitative data.

Conditional probabilities and graphical models

CPSC 540: Machine Learning

Markov Chain Monte Carlo methods

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

Information Retrieval and Web Search Engines

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Sequential Monitoring of Clinical Trials Session 4 - Bayesian Evaluation of Group Sequential Designs

Kernel density estimation in R

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Bayesian Classification Methods

Principles of Bayesian Inference

From Practical Data Analysis with JMP, Second Edition. Full book available for purchase here. About This Book... xiii About The Author...

1 Introduction. 2 AIC versus SBIC. Erik Swanson Cori Saviano Li Zha Final Project

Bayesian Inference. Introduction

Bayesian Estimation An Informal Introduction

Bivariate Data: Graphical Display The scatterplot is the basic tool for graphically displaying bivariate quantitative data.

Observations Homework Checkpoint quizzes Chapter assessments (Possibly Projects) Blocks of Algebra

Data Analysis and Uncertainty Part 2: Estimation

Chapter 5. Bayesian Statistics

Math 138: Introduction to solving systems of equations with matrices. The Concept of Balance for Systems of Equations

Two examples of the use of fuzzy set theory in statistics. Glen Meeden University of Minnesota.

Principles of Bayesian Inference

Intro to Bayesian Methods

Journeys of an Accidental Statistician

Eco517 Fall 2004 C. Sims MIDTERM EXAM

Probability calculus and statistics

Predictive Distributions

COS513 LECTURE 8 STATISTICAL CONCEPTS

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

A Bayesian Method for Guessing the Extreme Values in a Data Set

CSC321 Lecture 18: Learning Probabilistic Models

Chapter 9: Interval Estimation and Confidence Sets Lecture 16: Confidence sets and credible sets

Part III. A Decision-Theoretic Approach and Bayesian testing

2. I will provide two examples to contrast the two schools. Hopefully, in the end you will fall in love with the Bayesian approach.

LECTURE NOTE #3 PROF. ALAN YUILLE

One-parameter models

Understanding Inference: Confidence Intervals I. Questions about the Assignment. The Big Picture. Statistic vs. Parameter. Statistic vs.

Dublin City Schools Mathematics Graded Course of Study Algebra I Philosophy

Bayesian Inference. p(y)

CONTENTS. IBDP Mathematics HL Page 1

Open book, but no loose leaf notes and no electronic devices. Points (out of 200) are in parentheses. Put all answers on the paper provided to you.

LECTURE 15: SIMPLE LINEAR REGRESSION I

Review: Statistical Model

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

g-priors for Linear Regression

Inference Control and Driving of Natural Systems

Physics 403. Segev BenZvi. Credible Intervals, Confidence Intervals, and Limits. Department of Physics and Astronomy University of Rochester

Table 2.1 presents examples and explains how the proper results should be written. Table 2.1: Writing Your Results When Adding or Subtracting

Note on Bivariate Regression: Connecting Practice and Theory. Konstantin Kashin

Bayesian Inference and MCMC

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Probabilistic Reasoning in Deep Learning

Part 2: One-parameter models

RATES OF CHANGE. A violin string vibrates. The rate of vibration can be measured in cycles per second (c/s),;

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

NONINFORMATIVE NONPARAMETRIC BAYESIAN ESTIMATION OF QUANTILES

Advanced Statistical Methods. Lecture 6

- measures the center of our distribution. In the case of a sample, it s given by: y i. y = where n = sample size.

Section 2.5 Absolute Value Functions

Chapter 4. Bayesian inference. 4.1 Estimation. Point estimates. Interval estimates

Conceptual Explanations: Simultaneous Equations Distance, rate, and time

The binomial model. Assume a uniform prior distribution on p(θ). Write the pdf for this distribution.

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur

Transcription:

Chapter 7 The posterior - the goal of Bayesian inference 7.1 Googling Suppose you are chosen, for your knowledge of Bayesian statistics, to work at Google as a search traffic analyst. Based on historical data you have the data shown in Table 7.1 for the actual word searched, and the starting string (the first three letters typed in a search). It is your job to help make the search engines faster, by reducing the search-space for the machines to lookup each time a person types. Barack Obama Baby clothes Bayes Bar 50% 30% 30% Bab 30% 60% 30% Bay 20% 10% 40% Table 7.1: The columns give the historic breakdown of the search traffic for three topics: Barack Obama, Baby clothes, and Bayes; by the first three letters of the user s search. Problem 7.1.1. Find the minimum-coverage confidence intervals of topics that are at least at 70%. In both cases (here and the next question) we are looking for sets of the actual words. Frequentists assume that the data we receive (each set of three letters) is a sample from an infinity of such experiments. As such they design their intervals such that at least 70% of such intervals contain the true word searched across all the potential data samples we could receive. This means that here we want to choose sets such that regardless of the three letters typed we get a coverage of at least 70% for each of the columns. These are shown in Table 7.2. Problem 7.1.2. Find most narrow credible intervals for topics that are at least at 70%. Bayesians condition of the data we actually receive, and derive intervals based on this information. 1

2 CHAPTER 7. THE POSTERIOR - THE GOAL OF BAYESIAN INFERENCE Barack Obama Baby clothes Bayes Credibility Bar [50%] 30% 30% 45% Bab 30% [60% 30%] 75% Bay [20% 10% 40%] 100% Coverage 70% 70% 70% Table 7.2: 70% confidence intervals. This means we need to consider the individual row sums; each time making an interval that exceeds at least 70% of that row. The answer for this question is shown in Table 7.3. Barack Obama Baby clothes Bayes Credibility Bar [50% 30%] 30% 73% Bab 30% [60% 30%] 75% Bay 20% [10% 40%] 71% Coverage 50% 100% 70% Table 7.3: 70% credible intervals. Now we suppose that your boss gives you the historic search information shown in Table 7.4. Further, you are told that it is most important to correctly suggest the actual topic as one of the first auto-complete options, irrespective of the topic searched. Problem 7.1.3. Do you prefer confidence intervals or credible intervals in this circumstance? Here all we need to do is work out the total losses under the confidence and credible intervals. For both cases this means we need to work out the expected loss for each of the actual words being searched, using the volumes given in Table 7.4. This is easily done using the coverages at the bottom of each of tables 7.2 and 7.3. For the confidence intervals we thus get an expected loss: And for the credible intervals: loss = 0.6 (1 0.7) + 0.3 (1 0.7) + 0.1 (1 0.7) = 0.3 (7.1) loss = 0.6 (1 0.5) + 0.3 (1 1) + 0.1 (1 0.7) = 0.33 (7.2) So in this circumstance we prefer the confidence intervals. Barack Obama Baby clothes Bayes Search volume 60% 30% 10% Table 7.4: The historic search traffic broken down by topic. Problem 7.1.4. Now assume that it is most important to pick the correct actual word across all potential sets of three letters. Which interval do you prefer now?

7.2. GDP VERSUS INFANT MORTALITY 3 Now we need to find the loss for each possible three letter search. This requires that we first of all calculate the historic search volumes for these letters using Tables 7.1 and 7.4. Specifically you take the matrix product of the two, yielding a percentage of historical searches of (42%, 39%, 19%) for (bar, bab bay). You then use the credible levels for each of the rows from the confidence and credible interval tables respectively to weight the losses. For confidence intervals: And for credible intervals: loss = 0.42 (1 0.45) + 0.39 (1 0.75) + 0.19 (1 1) = 0.33 (7.3) loss = 0.42 (1 0.73) + 0.39 (1 0.75) + 0.19 (1 0.71) = 0.27 (7.4) So in this case we prefer the credible intervals. 7.2 GDP versus infant mortality The data in posterior_gdpinfantmortality.csv contains the GDP per capita (in real terms) and infant mortality across a large sample of countries in 1998. Problem 7.2.1. A simple model is fit to the data of the form: M i N (α + βgdp i, σ) (7.5) Fit this model to the data using a Frequentist approach. How well does the model fit the data? Graph the data first! I have perhaps been a bit misleading here asking the student to fit the model before graphing the data. However, this is sort of the point. You should never - blindly - fit a model to data. If you graph the data, you see that a linear model is not well suited to the data at all; a power law is better-suited. You can also use AIC/BIC/adjusted-R 2 etc. to compare between these models, but really the graphical explanation is the gold standard. Problem 7.2.2. An alternative model is: log(m i ) N (α + βlog(gdp ) i, σ) (7.6) Fit this model to the data using a Frequentist approach. Which model do you prefer, and why? See above - this model is much better suited! Problem 7.2.3. Construct 80% confidence intervals for (α, β) for the log-log model.

4 CHAPTER 7. THE POSTERIOR - THE GOAL OF BAYESIAN INFERENCE Take the point estimates of the parameters and add the relevant critical values of a standardised Student T distribution with n 2 degrees of freedom (here the population standard deviation is unknown so we need to use a T rather than a normal) multiplied by the parameter s standard error. The 10% critical value (we need 10% values because we are using a two-sided test) of a T with the relevant degrees of freedom is 1.28. We therefore obtain the following 80% confidence intervals: 6.8 α 7.3 0.53 β 0.46 Problem 7.2.4. We have fit the log-log model to the data using MCMC. Samples from the posterior for (α, β, σ) are contained within the file posterior_posteriorsgdpinfantmortality.csv. Using this data find the 80% credible intervals for all parameters (assuming these intervals to be symmetric about the median). How do these compare with the confidence intervals calculated above for (α, β)? How does the point estimate of σ from the Frequentist approach above compare? Using the quantile function these can be estimated: 6.8 α 7.3 0.53 β 0.46 0.56 σ 0.64 The first two are, to the accuracy shown, indistinguishable from the Frequentist estimates. The point estimate for σ for the two approaches is: ˆσ F = 0.59 ˆσ B = 0.60 where I have used the posterior mean for the Bayesian estimate. Problem 7.2.5. The following priors were used for the three parameters: α N (0, 10) β N (0, 10) σ N (0, 5), where σ 0 Explain any similarity between the confidence and credible intervals in this case. Here the priors are very diffuse over the range of possible range of the parameters. To a (rough) approximation this is equivalent to a flat prior on the parameters. This means from Bayes rule we have (approximately):

7.3. BAYESIAN NEUROSURGERY 5 p(θ X) p(x θ) (7.7) Therefore the confidence and credible intervals are going to be largely similar here. Problem 7.2.6. How are the estimates of parameters (α, β, σ) correlated? Why? α and β are negatively correlated. This is because we want a line that goes through the centre of the data: if the y intercept increases then the slope must decrease. Problem 7.2.7. Generate samples from the prior predictive distribution. How does the min and max of the prior predictive distribution compare with the actual data? The prior predictive distributions show about two orders of magnitude greater variation in data compared to the actual data. Problem 7.2.8. Generate samples from the posterior predictive distribution, and compare these with the actual data. How well does the model fit the data? There are a number of ways to compare the model vs the data here. I have just used the min and max as a point of comparison. What we see with these is that the minimum is captured well by the model, but the max isn t. In particular the variation seen in fitted model is greater than that in the data. This is because at low values of GDP there could be a deviation from the log-log model (or it s just due to sampling variation, of course). 7.3 Bayesian neurosurgery Suppose that you are a neurosurgeon and have been given the unenviable task of finding the position of a tumour within a patient s brain, and cutting it out. Along two dimensions - vertical height and left-right axis - the tumour s position is known to a high degree of confidence. However, along the remaining axis (front-back) the position is uncertain, and cannot be ascertained without surgery. However, a team of brilliant statisticians has already done most of the job for you, and has generated samples from the posterior for the tumour s location along this axis, and is given by the data contained within the data file posterior_braindata.csv. Suppose that the more brain that is cut, the more the patient is at risk of losing cognitive functions. Additionally, suppose that there is uncertainty over the amount of damage done to the patient during surgery. As such, three different surgeons have differing views on the damage caused: 1. Surgeon 1: Damage varies quadratically with the distance the surgery starts away from the tumour. 2. Surgeon 2: There is no damage if tissue cut is within 0.0001mm of the tumour; for cuts further away there is a fixed damage.

6 CHAPTER 7. THE POSTERIOR - THE GOAL OF BAYESIAN INFERENCE 3. Surgeon 3: Damage varies linearly with the absolute distance the surgery starts away from the tumour. (Hard - use fundamental theorem of Calculus for this part of the question.) Problem 7.3.1. Under each of the three regimes above, find the best position along this axis to cut. Surgeon 1: A quadratic loss function has the form: L(, θ) = ( θ) 2, where is the point estimate of the parameter, and θ is the actual value. We can find the expected loss: E(L) = ( θ) 2 p(θ x)dθ = 2 p(θ x)dθ 2 θp(θ x)dθ + θ 2 p(θ x)dθ = 2 2 < θ x > + < θ x > 2 which is minimised if =< θ x >. From the data the mean is 6.1. Surgeon 2: A binary loss function has the form: L(, θ) = 1 δ θ=, where the δ is a Dirac delta at θ =. Finding the expected loss: E(L) = (1 δ θ=)p(θ x)dθ = 1 p(θ = x) which is minimised when = arg max θ p(θ x); in other words the MAP estimator, which from a histogram of the data is at around 4.5. Surgeon 3: A linear loss is of the form: L(, θ) = θ, resulting in an expected loss of: E(L) = θ p(θ x)dθ = ( θ)p(θ x)dθ + (θ )p(θ x)dθ = p(θ x)dθ p(θ x)dθ θp(θ x)dθ + θp(θ x)dθ

7.3. BAYESIAN NEUROSURGERY 7 Differentiating the above we get (using Feynman s differentiation under the equals sign): dl d = 2p( x) + p(θ x)dθ p(θ x)dθ 2p( x) = p(θ x)dθ p(θ x)dθ = 0 which is true only when is 5.19. p(θ x)dθ = 0.5; in other words is the median. For the data the median Problem 7.3.2. Which of the above loss functions do you think is most appropriate, and why? I would say that either the quadratic or linear loss functions are more appropriate than the binary loss. One could argue that losses to brain function are likely the proportional to the volume of tissue lost. Since a quadratic loss is closer to this (a cubic loss), then we might suppose that this is most appropriate. Problem 7.3.3. Which loss function might you choose to be most robust to any situation? Either the quadratic or linear losses since most problems exhibit a loss that increases in the distance away from the true value. Problem 7.3.4. Following from the previous point, which type of posterior point measure might be most widely applicable? Either the posterior mean or median. Problem 7.3.5. Using the data estimate the loss under the three different regimes assuming that the true loss L(, θ) = ( θ) 3. The mean loss from the mean is 151; from the mode it is 245; from the median it is 175.

8 CHAPTER 7. THE POSTERIOR - THE GOAL OF BAYESIAN INFERENCE

Bibliography 9