Introduction to Bayesian Statistics 1

Similar documents
Introduction 1. STA442/2101 Fall See last slide for copyright information. 1 / 33

STA 2101/442 Assignment 3 1

Categorical Data Analysis 1

Joint Distributions: Part Two 1

A Very Brief Summary of Bayesian Inference, and Examples

PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.

Bayesian philosophy Bayesian computation Bayesian software. Bayesian Statistics. Petter Mostad. Chalmers. April 6, 2017

Continuous Random Variables 1

Bayesian inference: what it means and why we care

STA 2101/442 Assignment 2 1

Data Analysis and Uncertainty Part 2: Estimation

Bayesian Methods for Machine Learning

Part III. A Decision-Theoretic Approach and Bayesian testing

Beta statistics. Keywords. Bayes theorem. Bayes rule

Inference for a Population Proportion


The Bootstrap 1. STA442/2101 Fall Sampling distributions Bootstrap Distribution-free regression example

Answers and expectations

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Monte Carlo in Bayesian Statistics

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

INTRODUCTION TO BAYESIAN STATISTICS


Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Generalized Linear Models 1

Frequentist Statistics and Hypothesis Testing Spring

Introduction to Bayesian Methods

Power and Sample Size for Log-linear Models 1

Chapter 5. Bayesian Statistics

Notes on Decision Theory and Prediction

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL*

Two examples of the use of fuzzy set theory in statistics. Glen Meeden University of Minnesota.

Bayesian phylogenetics. the one true tree? Bayesian phylogenetics

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Principles of Bayesian Inference

A Very Brief Summary of Statistical Inference, and Examples

Bios 6649: Clinical Trials - Statistical Design and Monitoring

Interactions and Factorial ANOVA

Biol 206/306 Advanced Biostatistics Lab 12 Bayesian Inference Fall 2016

Statistical Methods in Particle Physics Lecture 1: Bayesian methods

Interactions and Factorial ANOVA

Metropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9

STAT 830 Bayesian Estimation

Computer intensive statistical methods

Bayesian Phylogenetics:

Markov Chain Monte Carlo methods

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Bayesian Inference for Dirichlet-Multinomials

Harrison B. Prosper. CMS Statistics Committee

Probability Review - Bayes Introduction

Bayesian Statistics Part III: Building Bayes Theorem Part IV: Prior Specification

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

(1) Introduction to Bayesian statistics

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Bayesian Inference and MCMC

David Giles Bayesian Econometrics

STAT 425: Introduction to Bayesian Analysis

Computer intensive statistical methods

Deciding, Estimating, Computing, Checking

Deciding, Estimating, Computing, Checking. How are Bayesian posteriors used, computed and validated?

Department of Statistics

Introduction. Start with a probability distribution f(y θ) for the data. where η is a vector of hyperparameters

Data Analysis and Uncertainty Part 1: Random Variables

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Bayesian RL Seminar. Chris Mansley September 9, 2008

The Bayesian Paradigm

CSC 2541: Bayesian Methods for Machine Learning

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

16 : Markov Chain Monte Carlo (MCMC)

ST 740: Model Selection

STA441: Spring Multiple Regression. This slide show is a free open source document. See the last slide for copyright information.

9 Bayesian inference. 9.1 Subjective probability

CPSC 540: Machine Learning

The Multinomial Model

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Bayesian Inference. Chapter 1. Introduction and basic concepts

Biol 206/306 Advanced Biostatistics Lab 12 Bayesian Inference

Bayesian Networks in Educational Assessment

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

1 Probabilities. 1.1 Basics 1 PROBABILITIES

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

Computational Perception. Bayesian Inference

Statistical Inference

Introduction to Bayesian Statistics

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Contingency Tables Part One 1

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Markov Chain Monte Carlo methods

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

Comparison of Three Calculation Methods for a Bayesian Inference of Two Poisson Parameters

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Lecture 2: Statistical Decision Theory (Part I)

an introduction to bayesian inference

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Computational Cognitive Science

Monte Carlo Inference Methods

Bayesian Estimation with Sparse Grids

Transcription:

Introduction to Bayesian Statistics 1 STA 442/2101 Fall 2018 1 This slide show is an open-source document. See last slide for copyright information. 1 / 42

Thomas Bayes (1701-1761) Image from the Wikipedia 2 / 42

Bayes Theorem Bayes Theorem is about conditional probability. It has statistical applications. 3 / 42

Bayes Theorem The most elementary version A B A B P (A B) = = = P (A B) P (B) P (A B) P (A B) + P (A c B) P (B A)P (A) P (B A)P (A) + P (B A c )P (A c ) 4 / 42

There are many versions of Bayes Theorem For discrete random variables, P (X = x Y = y) = = P (X = x, Y = y) P (Y = y) P (Y = y X = x)p (X = x) P (Y = y X = t)p (X = t) t 5 / 42

For continuous random variables f x y (x y) = f xy(x, y) f y (y) = f y x(y x)f x (x) fy x (y t)f x (t) dt 6 / 42

Compare P (A B) = P (X = x Y = y) = f x y (x y) = P (B A)P (A) P (B A)P (A) + P (B A c )P (A c ) P (Y = y X = x)p (X = x) P (Y = y X = t)p (X = t) t f y x (y x)f x (x) fy x (y t)f x (t) dt 7 / 42

Philosophy Bayesian versus Frequentist What is probability? Probability is a formal axiomatic system (Thank you Mr. Kolmogorov). Of what is probability a model? 8 / 42

Of what is probability a model? Two answers Frequentist: Probability is long-run relative frequency. Bayesian: Probability is degree of subjective belief. 9 / 42

Statistical inference How it works Adopt a probability model for data X. Distribution of X depends on a parameter θ. Use observed value X = x to decide about θ. Translate the decision into a statement about the process that generated the data. Bayesians and Frequentists agree so far, mostly. The description above is limited to what frequentists can do. Bayes methods can generate more specific recommendations. 10 / 42

What is parameter? To the frequentist, it is an unknown constant. To the Bayesian since we don t know the value of the parameter, it s a random variable. 11 / 42

Unknown parameters are random variables To the Bayesian That s because probability is subjective belief. We model our uncertainty with a probability distribution, π(θ). π(θ) is called the prior distribution. Prior because it represents the statistician s belief about θ before observing the data. The distribution of θ after seeing the data is called the posterior distribution. The posterior is the conditional distribution of the parameter given the data. 12 / 42

Bayesian Inference Model is p(x θ) or f(x θ). Prior distribution π(θ) is based on the best available information. But yours might be different from mine. It s subjective. Use Bayes Theorem to obtain the posterior distribution π(θ x). As the notation indicates, π(θ x) might be the prior for the next experiment. 13 / 42

Subjectivity Subjectivity is the most frequent objection to Bayesian methods. The prior distribution influences the conclusions. Two scientists may arrive at different conclusions from the same data, based on the same statistical analysis. The influence of the prior goes to zero as the sample size increases For all but the most bone-headed priors. 14 / 42

Bayes Theorem Continuous case π(θ x) = f(x θ)π(θ) f(x t)π(t) dt f(x θ)π(θ) 15 / 42

Bayes Theorem General case E(g(θ x)) = g(θ)f(x θ)dπ(θ) f(x θ)dπ(θ) 16 / 42

Once you have the posterior distribution, you can... Give a point estimate of θ. Why not E(θ X = x)? Test hypotheses, like H 0 : θ H. Reject H 0 if P (θ H X = x) < P (θ / H X = x). Why not? We should be able to do better than Why not? 17 / 42

Decision Theory Any time you make a decision, you can lose something. Risk is defined as expected loss. Goal: Make decisions so as to minimize risk. Or if you are an optimist, you can maximize expected utility. 18 / 42

Decisions d = d(x) D d is a decision. It is based on the data. It is an element of a decision space. 19 / 42

Decision space D It is the set of possible decisions that might be made based on the data. For estimation, D is the parameter space. For accepting or rejecting a null hypothesis, D consists of 2 points. Other kinds of kinds of decision are possible, not covered by frequentist inference. What kind of chicken feed should the farmer buy? 20 / 42

Loss function L = L (d(x), θ) 0 When X and θ are random, L is a real-valued random variable. 21 / 42

Risk is Expected Loss L = L (d(x), θ) E(L) = E(E[L X]) ( ) = L (d(x), θ) dπ(θ x) dp (x) Any decision d(x) that minimizes posterior expected loss for all x also minimizes overall expected loss (risk). Such a decision is called a Bayes decision. This is the theoretical basis for using the posterior distribution. We need an example. 22 / 42

Coffee taste test A fast food chain is considering a change in the blend of coffee beans they use to make their coffee. To determine whether their customers prefer the new blend, the company plans to select a random sample of n = 100 coffee-drinking customers and ask them to taste coffee made with the new blend and with the old blend, in cups marked A and B. Half the time the new blend will be in cup A, and half the time it will be in cup B. Management wants to know if there is a difference in preference for the two blends. 23 / 42

Model: The conditional distribution of X given θ Letting θ denote the probability that a consumer will choose the new blend, treat the data X 1,..., X n as a random sample from a Bernoulli distribution. That is, independently for i = 1,..., n, p(x i θ) = θ x i (1 θ) 1 x i for x i = 0 or x i = 1, and zero otherwise. p(x θ) = n θ x i (1 θ) 1 x i i=1 = θ n i=1 x i (1 θ) n n i=1 x i 24 / 42

Prior: The Beta distribution π(θ) = Γ(α + β) Γ(α)Γ(β) θα 1 (1 θ) β 1 For 0 < θ < 1, and zero otherwise. Note α > 0 and β > 0 25 / 42

Beta prior: π(θ) = Γ(α+β) Γ(α)Γ(β) θα 1 (1 θ) β 1 Supported on [0, 1]. E(θ) = V ar(θ) = α α+β αβ (α+β) 2 (α+β+1). Can assume a variety of shapes depending on α and β. When α = β = 1, it s uniform. Bayes used a Bernoulli model and a uniform prior in his posthumous paper. 26 / 42

Posterior distribution π(θ x) p(x θ) π(θ) = θ n i=1 x i (1 θ) n n i=1 x i Γ(α + β) Γ(α)Γ(β) θα 1 (1 θ) β 1 θ (α+ n i=1 x i) 1 (1 θ) (β+n n i=1 x i) 1 Proportional to the density of a Beta(α, β ), with α = α + n i=1 x i β = β + n n i=1 x i So that s it! 27 / 42

Conjugate Priors Prior was Beta(α, β). Posterior is Beta(α, β ). Prior and posterior are in the same family of distributions. The Beta is a conjugate prior for the Bernoulli model. Posterior was obtained by inspection. Conjugate priors are very convenient. There are conjugate priors for many models. There are also important models for which conjugate priors do not exist. 28 / 42

Picture of the posterior Suppose 60 out of 100 consumers picked the new blend of coffee beans. Posterior is Beta, with α = α + n i=1 xi = 61 and β = β + n n i=1 xi = 41. Posterior Density π(θ x) 0 2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 θ 29 / 42

Estimation Question: How should I estimate θ? Answer to the question is another question: What is your loss function? First, what is the decision space? D = (0, 1), same as the parameter space. d D is a guess about the value of θ. The loss function is up to you, but surely the more you are wrong, the more you lose. How about squared error loss? L(d, θ) = k(d θ) 2 We can omit the proportionality constant k. 30 / 42

Minimize expected loss L(d, θ) = (d θ) 2 Denote E(θ X = x) by µ. Then E (L(d, θ) X = x) = E ( (d θ) 2 X = x ) = E ( (d µ + µ θ) 2 X = x ) =... = E ( (d µ) 2 X = x ) + E ( (θ µ) 2 X = x ) = (d µ) 2 + V ar(θ X = x) Minimal when d = µ = E(θ X = x), the posterior mean. This was general. The Bayes estimate under squared error loss is the posterior mean. 31 / 42

Back to the example Give the Bayes estimate of θ under squared error loss. Posterior distribution of θ is Beta, with α = α + n i=1 x i = 61 and β = β + n n i=1 x i = 41. > 61/(61+41) [1] 0.5980392 32 / 42

Hypothesis Testing θ > 1 means consumers tend to prefer the new blend of coffee. 2 Test H 0 : θ θ 0 versus H 1 : θ > θ 0. What is the loss function? When you are wrong, you lose. Try zero-one loss. Loss L(d j, θ) Decision When θ θ 0 When θ > θ 0 d 0 : θ θ 0 0 1 d 1 : θ > θ 0 1 0 33 / 42

Compare expected loss for d 0 and d 1 Loss L(d j, θ) Decision When θ θ 0 When θ > θ 0 d 0 : θ θ 0 0 1 d 1 : θ > θ 0 1 0 Note L(d 0, θ) = I(θ > θ 0 ) and L(d 1, θ) = I(θ θ 0 ). E (I(θ > θ 0 ) X = x) = P (θ > θ 0 X = x) E (I(θ θ 0 ) X = x) = P (θ θ 0 X = x) Choose the smaller posterior probability of being wrong. Equivalently, reject H 0 if P (H 0 X = x) < 1 2. 34 / 42

Back to the example Decide between H 0 : θ 1/2 and H 1 : θ > 1/2 under zero-one loss. Posterior distribution of θ is Beta, with α = α + n i=1 x i = 61 and β = β + n n i=1 x i = 41. Want P (θ > 1 2 X = x) > 1 - pbeta(1/2,61,41) # P(theta > theta0 X=x) [1] 0.976978 35 / 42

How much worse is a Type I error? Loss L(d j, θ) Decision When θ θ 0 When θ > θ 0 d 0 : θ θ 0 0 1 d 1 : θ > θ 0 k 0 To conclude H 1, posterior probability must be at least k times as big as posterior probability of H 0. k = 19 is attractive. A realistic loss function for the taste test would be more complicated. 36 / 42

Computation Inference will be based on the posterior. Must be able to calculate E(g(θ) X = x) For example, E(L(d, θ) X = x) Or at least L(d, θ)f(x θ)π(θ) dθ. If θ is of low dimension, numerical integration usually works. For high dimension, it can be tough. 37 / 42

Monte Carlo Integration to get E(g(θ) X = x) Based on simulation from the posterior Sample θ 1,..., θ m independently from the posterior distribution and calculate 1 m m j=1 g(θ j ) a.s. E(g(θ) X = x) By the Law of Large Numbers. Large-sample confidence interval is helpful. 38 / 42

Sometimes it s Hard If the posterior is a familiar distribution (and you know what it is), simulating values from the posterior should be routine. If the posterior is unknown or very unfamiliar, it s a challenge. 39 / 42

The Gibbs Sampler Geman and Geman (1984) θ = (θ 1,..., θ k ) is a random vector with a (posterior) joint distribution. It is relatively easy to sample from the conditional distribution of each component given the others. Algorithm, say for θ = (θ 1, θ 2, θ 3 ): First choose starting values of θ 2 and θ 3 somehow. Then, Sample from the conditional distribution of θ 1 given θ 2 and θ 3. Set θ 1 to the resulting number. Sample from the conditional distribution of θ 2 given θ 1 and θ 3. Set θ 2 to the resulting number. Sample from the conditional distribution of θ 3 given θ 1 and θ 2. Set θ 3 to the resulting number. Repeat. 40 / 42

Output The Gibbs sampler produces a sequence of random (θ 1, θ 2, θ 3 ) vectors. Each one depends on the past only through the most recent one. It s a Markov process. Under technical conditions (Ergodicity), it has a stationary distribution that is the desired (posterior) distribution. Stationarity is a concept. In practice, a burn in period is used. The random vectors are sequentially dependent. Time series diagnostics may be helpful. Retain one parameter vector every n iterations, and discard the rest. 41 / 42

Copyright Information This slide show was prepared by Jerry Brunner, Department of Statistical Sciences, University of Toronto. It is licensed under a Creative Commons Attribution - ShareAlike 3.0 Unported License. Use any part of it as you like and share the result freely. The L A TEX source code is available from the course website: http://www.utstat.toronto.edu/ brunner/oldclass/appliedf18 42 / 42