Some Asymptotic Bayesian Inference (background to Chapter 2 of Tanner s book)

Similar documents
Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

ECE521 W17 Tutorial 6. Min Bai and Yuhuai (Tony) Wu

COMP90051 Statistical Machine Learning

Classical and Bayesian inference

Compute f(x θ)f(θ) dθ

CSC321 Lecture 18: Learning Probabilistic Models

Discrete Binary Distributions

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Computational Perception. Bayesian Inference

Bayesian Methods: Naïve Bayes

Topic 16 Interval Estimation. The Bootstrap and the Bayesian Approach

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Probability and Estimation. Alan Moses

Time Series and Dynamic Models

Introduction to Machine Learning. Lecture 2

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Naïve Bayes classification

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Introduction to Probabilistic Machine Learning

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Probabilistic and Bayesian Machine Learning

Bayesian Inference and MCMC

Bayesian RL Seminar. Chris Mansley September 9, 2008

Terminology. Experiment = Prior = Posterior =

Classical and Bayesian inference

Lecture 2: Conjugate priors

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Topic 15: Simple Hypotheses

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Intro to Bayesian Methods

Hierarchical Models & Bayesian Model Selection

Lecture 2: Statistical Decision Theory (Part I)

A PECULIAR COIN-TOSSING MODEL

Non-Parametric Bayes

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

Lecture 18: Bayesian Inference

Bayesian Inference. p(y)

Probability and Inference

Bayesian Statistics Part III: Building Bayes Theorem Part IV: Prior Specification

Probability Rules. MATH 130, Elements of Statistics I. J. Robert Buchanan. Fall Department of Mathematics

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Computational Cognitive Science

Introduction into Bayesian statistics

Estimation of Quantiles

Probabilistic Graphical Models

6.1 Variational representation of f-divergences

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Frequentist Statistics and Hypothesis Testing Spring

Bayesian Inference: Posterior Intervals

COMP2610/COMP Information Theory

Introduction to Bayesian Statistics

14.30 Introduction to Statistical Methods in Economics Spring 2009

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

CS 630 Basic Probability and Information Theory. Tim Campbell

Beta statistics. Keywords. Bayes theorem. Bayes rule

Hypothesis Testing. Rianne de Heide. April 6, CWI & Leiden University

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Statistical Inference

Lecture 8: Information Theory and Statistics

Introduction to Machine Learning CMU-10701

A Brief Review of Probability, Bayesian Statistics, and Information Theory

Lecture 1: Probability Fundamentals

10. Exchangeability and hierarchical models Objective. Recommended reading

Lecture 10: Bayes' Theorem, Expected Value and Variance Lecturer: Lale Özkahya

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Conjugate Priors: Beta and Normal Spring 2018

Probability Theory and Simulation Methods

Introduction to Machine Learning

Lecture 6: Model Checking and Selection

Computational Cognitive Science

Comparison of Bayesian and Frequentist Inference

Linear Models A linear model is defined by the expression

Introduc)on to Bayesian Methods

Recitation 2: Probability

Probabilistic Graphical Models

The Bayesian Paradigm

Learning Bayesian network : Given structure and completely observed data

Exam 2 Practice Questions, 18.05, Spring 2014

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Model Checking and Improvement

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

CS 361: Probability & Statistics

Statistics: Learning models from data

Probability. Lecture Notes. Adolfo J. Rumbos

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Parametric Models: from data to models

Bayesian Statistics. Debdeep Pati Florida State University. February 11, 2016

Transcription:

Some Asymptotic Bayesian Inference (background to Chapter 2 of Tanner s book) Principal Topics The approach to certainty with increasing evidence. The approach to consensus for several agents, with increasing shared evidence. A role for statistical models in these asymptotic results. o symmetry/independence assumptions in these results. o data reduction o asymptotic Normal inference for these results. 1

Generalizing the coin-tossing example from last lecture: Sample space of (observable) outcomes: A 2-sided coin is repeatedly tossed, indefinitely, X = <X 1, X 2,, X n, > X j = 0, or X j = 1 as the coin lands tails up or heads up on the j th flip. So that, x = < x 1, x 2,, x n, > is a point of the space: = {0,1} ℵ 0 Of course, at any one time we observe only a finite, initial segment. 2

The events that make up the σ-algebra, &, are the (smallest) σ-field of sets including all the (historical) observable events, of the form, H n = < x 1, x 2,, x n, {0,1}, {0,1},.> The Statistical Model: Introduce a statistical quantity, a parameter T, such that the events in & have a determinate conditional probability, given the parameter. Bernoulli (i.i.d.) Coin flipping (continued): P(X j = 1 θ) = θ (j = 1, ), for 0 < θ < 1 P(H n θ) = θ k (1-θ) n-k, where k of the first n coordinates of H n are 1 and n-k of the first n coordinates H n of are 0. 3

Now, if we are willing to make T into a random variable (by expanding the σ-algebra accordingly), we can write Bayes theorem for the parameter: P(θ H n ) = P(H n θ) P(θ) / P(H n ) P(H n θ) P(θ) OR The posterior probability for T is proportional to the product of the likelihood for T and its prior probability. 4

With the conjugate Beta(α, β) prior for T P(θ H n ) is given by the distribution Beta(α+k, β+n-k) having mean (α+k) / (α+k+β+n-k) = (α+k) / (α+β+n) and variance (α+k)(β+n-k) / (α+β+n) 2 (α+β+n+1) Note: Here we may reduce the historical data of n-bits to two quantities (k, n-k). That is, the two likelihood functions: P(H n θ) and P(k, n-k θ) are the same. P(H n θ) = P(k, n-k θ) 5

Now, by the Strong Law of Large Numbers: for each ε > 0 and given θ P( lim n k/n - θ < ε θ) = 1. Hence, with probability 1, the sequence of posterior probabilities for θ lim n P(θ H n ) = lim n Beta(α+nθ, β+n(1-θ)) have a limit distribution with mean θ and variance 0, independent of α and β. Note: The posterior variances for θ are O(1/n). That is, in advance, we can bound from below the precision (that is, bound from above the variance) of the posterior distribution for the parameter by choosing the sample size to observe. 6

Asymptotic Certainty THUS, under each conjugate (Beta) prior: With probability 1, the posterior probability for θ converges to the (0-1) Delta distribution, concentrated on the true parameter value. From the perspective of the posterior probability for θ, through the likelihood function, the data swamp the prior. 7

Asymptotic Consensus (Merging of Posterior Probability Distributions) As a metric (distance) between two distributions P and Q over the algebra &, consider a strict standard, uniform distance, U(P,Q) = sup E & P(E) Q(E) Let P n = P(θ H n ) and Q n = Q(θ H n ) (n = 1, 2, ) be two sequences of posterior probability distributions for the parameter θ based on two (conjugate) Beta priors. Then, it is not hard to show that lim n U( P n, Q n ) = lim n sup Θ P(θ H n ) Q(θ H n ) = 0. In other words, the two systems of posterior probabilities for the parameter, based on shared evidence, merge together. 8

Question: What about posterior probability distributions over the algebra generated by the observable events, &? Recall that the i.i.d. Bernoulli statistical model for the data is shared between these two investigators: ( E &) P(E θ) = Q(E θ). Also, with conjugate priors from the Beta family, the prior probability is positive for each historical event H n. That is, ( H n, 0 < θ < 1) P(H n θ) > 0. Moreover, P(H n ) = Θ P(H n θ)dp(θ). Therefore, P(H n ) > 0, and likewise Q(H n ) > 0. Answer: P(E H n ) = Θ P(E θ, H n ) P(θ H n )dp n (θ). = Θ [P(E, H n θ) / P(H n θ)] P(θ H n )dp n (θ) and as P n merges with Q n for large n, Θ [P(E, H n θ) / P(H n θ)] Q(θ H n )dq n (θ) = Θ [Q(E, H n θ) / Q(H n θ)] Q(θ H n )dq n (θ) = Θ Q(E θ, H n ) Q(θ H n )dq n (θ) = Q(E H n ). 9

Thus, the two posterior predictive distributions (over &) also merge. For example, the probability that the next flip lands heads given H n is: P(X n+1 H n ) = E P n[θ] = (α+k) / (α+β+n), which for large n, k/n and by parallel reasoning Q(X n+1 H n ). Note, that the agreement between P(E H n ) and Q(E H n ) takes a stronger form for cases when the historical observation H n precludes E, when (E H n ) =. Then, P(E H n ) = Q(E H n ) = 0 for all n > n. 10

Question: What parts of these asymptotic results for the algebra of events & depends upon the (shared) statistical model? Answers: (1) Asymptotic Certainty is automatic with the Bayesian framework! ( E &) with P-probability 1, lim n P n (E) = F(E), i.e. (2) Asymptotic Consensus requires only agreement on null events. Assume that ( E &) P(E) = 0 if and only if Q(E) = 0. With P-(or Q-) probability 1, with respect to & lim n U( P n, Q n ) = 0. 11

However, the statistical model is needed for each of the following: (1) data reduction (2) rates of convergence to certainty (3) rates of merging for Bayesian investigators with shared evidence In the next lectures we will explore themes for Bayesian asymptotics: A role for statistical models in these asymptotic results. o symmetry/independence assumptions in these results. o data reduction o asymptotic Normal inference for these results. 12

A puzzlement? We have two investigators (T and Z) for our coin-tossing problem. They share the same statistical (i.i.d. Bernoulli) model for coin flips, and they have the same (conjugate) Beta prior for θ. They collect (shared) evidence by flipping the coin until one says, Stop. In fact, they observe the sequence (H,H,T,H,T,T,H,H,T,H) at which point they both (simultaneously) say Stop! However: T s plan was to flip the coin exactly 10 times and stop and Z s plan was to flip until there were 6 Heads and stop. Exercise: Give the Bayes analysis for T and for Z of these data. 13