STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Similar documents
Bayesian Models in Machine Learning

Intro to Bayesian Methods

Bayesian Learning (II)

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Introduction to Probabilistic Machine Learning

A Very Brief Summary of Bayesian Inference, and Examples

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

COS513 LECTURE 8 STATISTICAL CONCEPTS

Confidence Distribution

an introduction to bayesian inference

David Giles Bayesian Econometrics

Part III. A Decision-Theoretic Approach and Bayesian testing

COMP90051 Statistical Machine Learning

STAT J535: Introduction

CSC321 Lecture 18: Learning Probabilistic Models

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Bayesian Inference: Posterior Intervals

Density Estimation. Seungjin Choi

Basics of Bayesian analysis. Jorge Lillo-Box MCMC Coffee Season 1, Episode 5

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Principles of Bayesian Inference

Introduction to Bayesian Inference

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Time Series and Dynamic Models

(1) Introduction to Bayesian statistics

Bayesian Statistics. Debdeep Pati Florida State University. February 11, 2016

Lecture 1: Bayesian Framework Basics

Principles of Bayesian Inference

Bayesian inference: what it means and why we care

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Principles of Bayesian Inference

Applied Bayesian Statistics STAT 388/488

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

Bayesian Inference in Astronomy & Astrophysics A Short Course

Markov Chain Monte Carlo methods

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

A Brief Review of Probability, Bayesian Statistics, and Information Theory

Bayesian RL Seminar. Chris Mansley September 9, 2008

Principles of Bayesian Inference

Bayesian Inference. Anders Gorm Pedersen. Molecular Evolution Group Center for Biological Sequence Analysis Technical University of Denmark (DTU)

Hierarchical Models & Bayesian Model Selection

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

STAT 425: Introduction to Bayesian Analysis

Bayesian Statistics. State University of New York at Buffalo. From the SelectedWorks of Joseph Lucke. Joseph F. Lucke

2. A Basic Statistical Toolbox

Bayesian Inference. Chapter 1. Introduction and basic concepts

Probability Theory Review

Chapter 5. Bayesian Statistics

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Computational Perception. Bayesian Inference

One-parameter models

7. Estimation and hypothesis testing. Objective. Recommended reading

A primer on Bayesian statistics, with an application to mortality rate estimation

Statistical Data Analysis Stat 3: p-values, parameter estimation

A Very Brief Summary of Statistical Inference, and Examples

David Giles Bayesian Econometrics

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

Approximate Bayesian computation for spatial extremes via open-faced sandwich adjustment

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Foundations of Statistical Inference

Statistical Machine Learning Lecture 1: Motivation

Introduction to Bayesian Methods

E. Santovetti lesson 4 Maximum likelihood Interval estimation

Stat 451 Lecture Notes Numerical Integration

A BAYESIAN MATHEMATICAL STATISTICS PRIMER. José M. Bernardo Universitat de València, Spain

Introduction to Bayesian Statistics 1

PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

CSC 2541: Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning

A Bayesian Approach to Phylogenetics

Bayesian estimation of the discrepancy with misspecified parametric models

Bayesian Inference and MCMC

Machine Learning 4771

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Statistical Inference

Computational Cognitive Science

10. Exchangeability and hierarchical models Objective. Recommended reading

Statistics: Learning models from data

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Lecture 2: Statistical Decision Theory (Part I)

ECE521 week 3: 23/26 January 2017

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

The connection of dropout and Bayesian statistics

Statistical Methods in Particle Physics Lecture 1: Bayesian methods

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Bayesian Regression Linear and Logistic Regression

Bayesian Inference: Concept and Practice

Non-Parametric Bayes

Bayesian Asymptotics

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Transcription:

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01 Nasser Sadeghkhani a.sadeghkhani@queensu.ca There are two main schools to statistical inference: 1-frequentist (or classical) school, 2- Bayesian school. Most of the methods you have seen so far, likely are frequentist. It is important to understand both approaches. Frequentist vs. Bayesian Methods In frequentist inference, probabilities are interpreted as long run frequencies. The goal is to create procedures with long run frequency guarantees. However, in Bayesian inference, probabilities are interpreted as subjective degrees of belief. The goal is to state and analyze your beliefs. Some differences between the Bayesian and frequentist (non-bayesian) approaches are as follows:

Page 2 of 7 To illustrate the difference, consider the following example. Example 0.1. Assume X i s i = 1,, n are a random sample from N(θ, 1). We know that an CI 95% for θ is given by P ( θ [ ]) X 1.96/ n, X + 1.96/ n = 0.95. (0.1) In fact, the unknown parameter θ is a fix quantity and the interval is random because it is a function of the data. Equation (0.1) means that the interval [ X 1.96/ n, X + 1.96/ n] will trap the true value θ with probability 0.95. In contrast, the Bayesian treats probability as beliefs, not frequencies. The unknown parameter θ is given a prior distribution, say π(θ), representing our subjective beliefs of θ. After observing X i s, we try to update our believes and compute the posterior distribution for π(θ x 1,..., x n ) (We will see how later on). One can calculate and report the following: P ( θ [ X 1.96/ n, [ X + 1.96/ n] x 1,..., x n ) = 0.95. (0.2) Note that the probability in equation (0.2) is a degree of belief statement about unknown parameter θ given X i s and it is not the same as equation (0.1). That is, if we repeated this experiment many times, the intervals would not trap the true value 95 percent of the time. 1 Example 0.2. Let θ be the probability of a particular coin landing on heads, and suppose we want to test the hypotheses ( with α = 0.05) H 0 : θ = 1 2 H 1 : θ > 1 2, suppose the following sequence of flips has been observed {H, H, H, H, H, T }. 2 To perform a frequentist hypothesis test, we must define a random variable to describe the data. The proper way to do this depends on exactly which of the following two experiments was actually performed: (a) Suppose the experiment was Flip six times and record the results. In this case, the random variable X counts the number of heads, so X Bin(6, θ). Here x = 5, and the p value = P(X 5 θ = 1 2 ) = 0.11 and since it is not less than α = 0.05, we won t reject H 0. 1 We will be seeing the equation (0.2) is called the credible set (interval). 2 H: head, T : tale

Page 3 of 7 (b) In contrast, suppose the experiment was Flip until we get tails. In this case, X counts the number of the flip until occurrence of the first tail, so X Geo(1 θ) and the p value = P(X 6 θ = 1 2 ) = 0.031 and since it is less than α = 0.05, we reject H 0. The conclusions are different! In fact, the result of the hypothesis test depends on whether we would have stopped flipping if we had gotten a tails sooner. Remember that despite the results, the likelihood function for the observed value of x is the same for both experiments in (a) and (b) (up to a constant) 3 : P (x θ) θ 5 (1 θ). In a Bayesian approach we take the data into account only through this likelihood and therefore there would be a guarantee to provide same answers regardless of which experiment was being performed. Bayesian methods are widespread in statistics, specially some applied areas due to computational reasons rather than philosophical reasons. Furthermore all new techniques such as machine learning, data science and neural networks all are built on Bayes idea. Bayes theorem The word Bayesian dated back to the 18th century and English Reverend Thomas Bayes, who along with Pierre-Simon Laplace was among the first thinkers to consider the laws of chance and randomness in a quantitative, scientific way. Both Bayes and Laplace were aware of a relation that is now known as Bayes Theorem: 3 Often the times it is said both have the same kernel.

Page 4 of 7 Therefore, the Bayesian method comprises of the following principle steps: (1) Prior: Obtain the prior density P (θ) (or as another notation π(θ)) which expresses our knowledge about θ prior to observing the data. (2) Likelihood: Obtain the likelihood function P (y θ) (or L(θ; y)). This step simply describes the process giving rise to the data x in terms of θ. (3) Posterior: Apply Bayes theorem to derive posterior density P (θ y) which expresses all that is known about θ after observing the data x. (4) Inference: Derive appropriate inference statements from the posterior distribution e.g. point estimates, interval estimates, probabilities of specified hypotheses. Miscellaneous applications Bayesian statistics is applied in a large variety of different fields including: Economy (econometrics) to make decisions that optimize benefit under uncertainties Biostatistics enables experts to use their knowledge in the inference Machine learning often use nonparametric Bayesian models that adaptively becomes more complex as more data become available (We focus on parametric models in this course) Advantage and Disadvantages of being Bayesian Some advantages:

Page 5 of 7 Bayesian logic and interpretation are simple; scientific questions can often be easily framed as inferential questions. Bayesian inference is simple in principle and provides a single recipe for coherent inference, all based on the posterior. Utility of using prior information, allowing one to combine various sources of information, including constraints. Bayesian inference naturally deals with conditioning, marginalization, and nuisance parameters. Parameter uncertainty is naturally accounted for. Bayesian inference naturally meshes with decision theory. Modern computational techniques allow models to be fit under a Bayesian approach that cannot be fit in other ways. Bayesian results often have good frequentist properties and frequentist inference is sometimes a special case of Bayesian results under a particular prior. Complicated hierarchical models can be naturally constructed in a Bayesian framework. Bayesian inference naturally penalizes complex models. Bayesian inference can deal with multiple testing inherently if set up properly as a joint inference problem. Some disadvantages: Computing the posterior, while simple in theory, is often difficult and time consuming in practice. Bayesian inference is model based and classical methods and models may not generalize (partial likelihood, nonparametric testing, robust estimation, marginal models). Sensitivity to prior selection. Posterior may be heavily influenced by the priors (informative prior+ small data size). High computational cost. Simulation provide slightly different answers

Page 6 of 7 no guarantee of Markov Chain Mote Carlo (MCMC) convergence. In brief, Bayesian statistics may be preferable than frequentist statistics when research wants to combines knowledge modeling (info from expert, or pre-existing info) with knowledge discovery (data, evidence) to help with decision support (analytic, simulation, diagnosis, and optimization) and risk management. We try to motivate the use of priors on parameters and indeed motivate the very use of parameters. Definition 0.1. (Infinite exchangeability) We say that a sequence of rv s y 1, y 2,... is an infinitely exchangeable if, for any n, the joint probability p(y 1,..., y n ) is invariant to permutation of the indices. That is, for any permutation π, p(y 1,..., y n ) = p(y π1,..., y πn ). A key assumption of many statistical analyses is that the random variables being studied are independent and identically distributed (iid). Note that iid random variables are always infinitely exchangeable. However, the converse is not necessarily true. For example, let y 1, y 2,... be iid, and let y 0 be a non-trivial random variable independent of the rest. Then y 0 +y 1, y 0 +y 2,... is infinitely exchangeable but not iid. The strength of infinite exchangeability lies in the following theorem. Theorem 0.1. (De Finetti) A sequence of random variables y 1, y 2,... is infinitely exchangeable iff, for all n, for some measure P on θ. p(y 1,..., y n ) = n 1 P (y i θ) P (dθ), Clearly, since the product n 1 P (y i θ) is invariant to reordering, we have that any sequence distribution that can be written as n 1 P (y i θ) P (dθ) for all n must be (infinitely) exchangeable. The other direction, though, is much deeper. It says that if we have exchangeable data, then There must exist a parameter θ. here must exist a likelihood P (y θ). There must exist a distribution P on θ. Thus, the theorem provides an answer to the questions of why we should use parameters and why we should put priors on parameters.

Page 7 of 7 Example 0.3. (Bayes 1764) A billiard ball W rolls on a line of length one, with uniform probability of stopping anywhere. Stops at θ, i.e. θ U(0, 1). A second ball O is then rolled n times (same conditions) and Y denotes number of times O stopped left of W. What is the posterior of θ given y? π(θ y) π(θ) P (y θ) θ y (1 θ) n y Bet(y + 1, n y + 1), therefore θ y Bet(y + 1, n y + 1) and E(θ y) = y+1 n+2. It is also easy to show that the the maximum a posteriori (MAP) 4 and maximum likelihood (ML) estimators are ˆθ MAP = ˆθ ML = y n. Example 0.4. If Y Bin(n, θ) and θ Bet(α, β) ( α = β = 1 is the particular case of Example 0.1), then θ y Bet(y + α, n y + β). Remark 0.1. The Bayesian approach enjoys a specific kind of coherence in that the order in which i.i.d. observations are collected does not matter, but also that updating the prior one observation at a time, or all observations together, does not matter. In other words, π(θ y 1,..., y n ) = P (y n θ) π(θ y 1,..., y n 1 ) P (yn θ) π(θ y 1,..., y n 1 ) dθ = P (y n θ) P (y n 1 θ) π(θ y 1,..., y n 2 ) P (yn θ) P (y n 1 θ) π(θ y 1,..., y n 2 ) dθ =. = P (y n θ) P (y n 1 θ)... P (y 1 θ)π(θ) P (yn θ) P (y n 1 θ)... P (y 1 θ)π(θ) dθ. 4 argmin θ π(θ y) The End.