Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Similar documents
CSC321 Lecture 18: Learning Probabilistic Models

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Computational Cognitive Science

Computational Perception. Bayesian Inference

Introduction to Machine Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Lecture 2: Priors and Conjugacy

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

CS 361: Probability & Statistics

COMP90051 Statistical Machine Learning

Bayesian Inference: Posterior Intervals

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

Density Estimation. Seungjin Choi

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Bayesian RL Seminar. Chris Mansley September 9, 2008

Computational Cognitive Science

Probabilistic Graphical Models

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Bayesian Methods: Naïve Bayes

Hierarchical Models & Bayesian Model Selection

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Parametric Techniques Lecture 3

Introduction into Bayesian statistics

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Parametric Techniques

Announcements. Proposals graded

Bayesian Models in Machine Learning

CPSC 540: Machine Learning

Lecture 2: Conjugate priors

CS540 Machine learning L8

CS540 Machine learning L9 Bayesian statistics

CS 361: Probability & Statistics

(1) Introduction to Bayesian statistics

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Introduction to Probabilistic Machine Learning

Lecture 2: Statistical Decision Theory (Part I)

Introduction to Machine Learning

Lecture 4: Probabilistic Learning

Lecture : Probabilistic Machine Learning

Bayesian Inference and MCMC

Time Series and Dynamic Models

Introduction to Machine Learning

A primer on Bayesian statistics, with an application to mortality rate estimation

Learning Bayesian network : Given structure and completely observed data

Announcements. Proposals graded

Conjugate Priors, Uninformative Priors

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Bayesian Interpretations of Regularization

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Naïve Bayes classification

Statistics: Learning models from data

1 Maximum Likelihood Estimation

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Probability and Estimation. Alan Moses

Data Analysis and Uncertainty Part 2: Estimation

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Lecture 13 Fundamentals of Bayesian Inference

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

Bayesian Inference. Introduction

SAMPLE CHAPTER. Avi Pfeffer. FOREWORD BY Stuart Russell MANNING

Lecture 23 Maximum Likelihood Estimation and Bayesian Inference

MLE/MAP + Naïve Bayes

Homework 5: Conditional Probability Models

Introduction to Bayesian Statistics

Probability Review - Bayes Introduction

Bayesian Analysis (Optional)

STAT 425: Introduction to Bayesian Analysis

CS-E3210 Machine Learning: Basic Principles

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Outline. 1. Define likelihood 2. Interpretations of likelihoods 3. Likelihood plots 4. Maximum likelihood 5. Likelihood ratio benchmarks

Intro to Bayesian Methods

Chapter 5. Bayesian Statistics

CPSC 340: Machine Learning and Data Mining

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

g-priors for Linear Regression

Bayesian Machine Learning

Frequentist Statistics and Hypothesis Testing Spring

PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.

an introduction to bayesian inference

Bayesian Decision and Bayesian Learning

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

ECE521 W17 Tutorial 6. Min Bai and Yuhuai (Tony) Wu

COS513 LECTURE 8 STATISTICAL CONCEPTS

Some Asymptotic Bayesian Inference (background to Chapter 2 of Tanner s book)

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

6.867 Machine Learning

Bayesian statistics. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Transcription:

Bayesian Methods David S. Rosenberg New York University March 20, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 1 / 38

Contents 1 Classical Statistics 2 Bayesian Statistics: Introduction 3 Bayesian Decision Theory 4 Summary David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 2 / 38

Classical Statistics David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 3 / 38

Parametric Family of Densities A parametric family of densities is a set {p(y θ) : θ Θ}, where p(y θ) is a density on a sample space Y, and θ is a parameter in a [finite dimensional] parameter space Θ. This is the common starting point for a treatment of classical or Bayesian statistics. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 4 / 38

Density vs Mass Functions In this lecture, whenever we say density, we could replace it with mass function. Corresponding integrals would be replaced by summations. (In more advanced, measure-theoretic treatments, they are each considered densities w.r.t. different base measures.) David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 5 / 38

Frequentist or Classical Statistics Parametric family of densities {p(y θ) θ Θ}. Assume that p(y θ) governs the world we are observing, for some θ Θ. If we knew the right θ Θ, there would be no need for statistics. Instead of θ, we have data D: y 1,...,y n sampled i.i.d. p(y θ). Statistics is about how to get by with D in place of θ. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 6 / 38

Point Estimation One type of statistical problem is point estimation. A statistic s = s(d) is any function of the data. A statistic ˆθ = ˆθ(D) taking values in Θ is a point estimator of θ. A good point estimator will have ˆθ θ. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 7 / 38

Desirable Properties of Point Estimators Desirable statistical properties of point estimators: Consistency: As data size n, we get ˆθ n θ. Efficiency: (Roughly speaking) ˆθ n is as accurate as we can get from a sample of size n. Maximum likelihood estimators are consistent and efficient under reasonable conditions. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 8 / 38

The Likelihood Function Consider parametric family {p(y θ) : θ Θ} and i.i.d. sample D = (y 1,...,y n ). The density for sample D for θ Θ is n p(d θ) = p(y i θ). i=1 p(d θ) is a function of D and θ. For fixed θ, p(d θ) is a density function on Y n. For fixed D, the function θ p(d θ) is called the likelihood function: L D (θ) := p(d θ). David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 9 / 38

Maximum Likelihood Estimation Definition The maximum likelihood estimator (MLE) for θ in the model {p(y,θ) θ Θ} is ˆθ MLE = arg maxl D (θ). θ Θ Maximum likelihood is just one approach to getting a point estimator for θ. Method of moments is another general approach one learns about in statistics. Later we ll talk about MAP and posterior mean as approaches to point estimation. These arise naturally in Bayesian settings. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 10 / 38

Coin Flipping: Setup Parametric family of mass functions: p(heads θ) = θ, for θ Θ = (0,1). Note that every θ Θ gives us a different probability model for a coin. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 11 / 38

Coin Flipping: Likelihood function Data D = (H,H,T,T,T,T,T,H,...,T ) n h : number of heads n t : number of tails Assume these were i.i.d. flips. Likelihood function for data D: L D (θ) = p(d θ) = θ n h (1 θ) n t This is the probability of getting the flips in the order they were received. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 12 / 38

Coin Flipping: MLE As usual, easier to maximize the log-likelihood function: First order condition: ˆθ MLE = arg maxlogl D (θ) θ Θ = arg max[n h logθ + n t log(1 θ)] θ Θ n h θ n t 1 θ = 0 θ = So ˆθ MLE is the empirical fraction of heads. n h n h + n t. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 13 / 38

Bayesian Statistics: Introduction David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 14 / 38

Bayesian Statistics Introduces a new ingredient: the prior distribution. A prior distribution p(θ) is a distribution on parameter space Θ. A prior reflects our belief about θ, before seeing any data.. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 15 / 38

A Bayesian Model A [parametric] Bayesian model consists of two pieces: 1 A parametric family of densities {p(d θ) θ Θ}. 2 A prior distribution p(θ) on parameter space Θ. Putting pieces together, we get a joint density on θ and D: p(d,θ) = p(d θ)p(θ). David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 16 / 38

The Posterior Distribution The posterior distribution for θ is p(θ D). Prior represents belief about θ before observing data D. Posterior represents the rationally updated belief about θ, after seeing D. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 17 / 38

Expressing the Posterior Distribution By Bayes rule, can write the posterior distribution as p(θ D) = p(d θ)p(θ). p(d) Let s consider both sides as functions of θ, for fixed D. Then both sides are densities on Θ and we can write p(θ D) p(d θ) p(θ). }{{}}{{}}{{} posterior likelihood prior Where means we ve dropped factors independent of θ. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 18 / 38

Coin Flipping: Bayesian Model Parametric family of mass functions: p(heads θ) = θ, for θ Θ = (0,1). Need a prior distribution p(θ) on Θ = (0,1). A distribution from the Beta family will do the trick... David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 19 / 38

Coin Flipping: Beta Prior Prior: θ Beta(α, β) p(θ) θ α 1 (1 θ) β 1 Figure by Horas based on the work of Krishnavedala (Own work) [Public domain], via Wikimedia Commons http://commons.wikimedia.org/wiki/file:beta_distribution_pdf.svg. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 20 / 38

Coin Flipping: Beta Prior Prior: θ Beta(h, t) p(θ) θ h 1 (1 θ) t 1 Mean of Beta distribution: Mode of Beta distribution: Eθ = h h + t arg maxp(θ) = h 1 θ h + t 2 for h,t > 1. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 21 / 38

Coin Flipping: Posterior Prior: θ Beta(h, t) p(θ) θ h 1 (1 θ) t 1 Likelihood function L(θ) = p(d θ) = θ n h (1 θ) n t Posterior density: p(θ D) p(θ)p(d θ) θ h 1 (1 θ) t 1 θ n h (1 θ) n t = θ h 1+n h (1 θ) t 1+n t David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 22 / 38

Posterior is Beta Prior: θ Beta(h, t) p(θ) θ h 1 (1 θ) t 1 Posterior density: Posterior is in the beta family: p(θ D) θ h 1+n h (1 θ) t 1+n t θ D Beta(h + n h,t + n t ) Interpretation: Prior initializes our counts with h heads and t tails. Posterior increments counts by observed n h and n t. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 23 / 38

Sidebar: Conjugate Priors Interesting that posterior is in same distribution family as prior. Let π be a family of prior distributions on Θ. Let P parametric family of distributions with parameter space Θ. Definition A family of distributions π is conjugate to parametric model P if for any prior in π, the posterior is always in π. The beta family is conjugate to the coin-flipping (i.e. Bernoulli) model. The family of all probability distributions is conjugate to any parametric model. [Trivially] David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 24 / 38

Example: Coin Flipping - Concrete Example Suppose we have a coin, possibly biased (parametric probability model): Parameter space θ Θ = [0,1]. Prior distribution: θ Beta(2, 2). p(heads θ) = θ. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 25 / 38

Example: Coin Flipping Next, we gather some data D = {H,H,T,T,T,T,T,H,...,T }: Heads: 75 Tails: 60 ˆθ MLE = 75 75+60 0.556 Posterior distribution: θ D Beta(77, 62): David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 26 / 38

Bayesian Point Estimates So we have posterior θ D... But we want a point estimate ˆθ for θ. Common options: posterior mean ˆθ = E[θ D] maximum a posteriori (MAP) estimate ˆθ = argmax θ p(θ D) Note: this is the mode of the posterior distribution David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 27 / 38

What else can we do with a posterior? Look at it. Extract credible set for θ (Bayesian version of a confidence interval). e.g. Interval [a,b] is a 95% credible set if P(θ [a,b] D) 0.95 The most Bayesian approach is Bayesian decision theory: Choose a loss function. Find action minimizing expected risk w.r.t. posterior David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 28 / 38

Bayesian Decision Theory David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 29 / 38

Bayesian Decision Theory Ingredients: Parameter space Θ. Prior: Distribution p(θ) on Θ. Action space A. Loss function: l : A Θ R. The posterior risk of an action a A is r(a) := E[l(θ,a) D] = l(θ,a)p(θ D)dθ. It s the expected loss under the posterior. A Bayes action a is an action that minimizes posterior risk: r(a ) = min a A r(a) David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 30 / 38

Bayesian Point Estimation General Setup: Data D generated by p(y θ), for unknown θ Θ. Want to produce a point estimate for θ. Choose the following: Prior p(θ) on Θ = R. ( Loss l(ˆθ,θ) = θ ˆθ ) 2 Find action ˆθ Θ that minimizes posterior risk: ( ) ] 2 r(ˆθ) = E[ θ ˆθ D = ( ) 2 θ ˆθ p(θ D)dθ David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 31 / 38

Bayesian Point Estimation: Square Loss Find action ˆθ Θ that minimizes posterior risk r(ˆθ) = ( ) 2 θ ˆθ p(θ D)dθ. Differentiate: dr(ˆθ) d ˆθ ( ) = 2 θ ˆθ p(θ D)dθ = 2 θp(θ D)dθ + 2ˆθ = 2 θp(θ D)dθ + 2ˆθ p(θ D)dθ } {{ } =1 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 32 / 38

Bayesian Point Estimation: Square Loss Derivative of posterior risk is dr(ˆθ) d ˆθ = 2 θp(θ D)dθ + 2ˆθ. First order condition dr(ˆθ) d ˆθ = 0 gives ˆθ = θp(θ D)dθ = E[θ D] Bayes action for square loss is the posterior mean. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 33 / 38

Bayesian Point Estimation: Absolute Loss Loss: l(θ, ˆθ) = θ ˆθ Bayes action for absolute loss is the posterior median. That is, the median of the distribution p(θ D). Show with approach similar to what was used in Homework #1. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 34 / 38

Bayesian Point Estimation: Zero-One Loss Suppose Θ is discrete (e.g. Θ = {english, french}) Zero-one loss: l(θ, ˆθ) = 1(θ ˆθ) Posterior risk: [ ] r(ˆθ) = E 1(θ ˆθ) D ( ) Bayes action is = P = 1 P θ ˆθ D ( ) θ = ˆθ D = 1 p(ˆθ D) ˆθ = arg maxp(θ D) θ Θ This ˆθ is called the maximum a posteriori (MAP) estimate. The MAP estimate is the mode of the posterior distribution. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 35 / 38

Summary David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 36 / 38

Recap and Interpretation Prior represents belief about θ before observing data D. Posterior represents the rationally updated beliefs after seeing D. All inferences and action-taking are based on the posterior distribution. In the Bayesian approach, No issue of choosing a procedure or justifying an estimator. Only choices are family of distributions, indexed by Θ, and the prior distribution on Θ For decision making, need a loss function. Everything after that is computation. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 37 / 38

The Bayesian Method 1 Define the model: Choose a parametric family of densities: {p(d θ) θ Θ}. Choose a distribution p(θ) on Θ, called the prior distribution. 2 After observing D, compute the posterior distribution p(θ D). 3 Choose action based on p(θ D). David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 38 / 38