Intro to Bayesian Methods

Similar documents
STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Conjugate Priors, Uninformative Priors

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Introduction to Probabilistic Machine Learning

CSC321 Lecture 18: Learning Probabilistic Models

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

CS 361: Probability & Statistics

Compute f(x θ)f(θ) dθ

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Computational Perception. Bayesian Inference

Frequentist Statistics and Hypothesis Testing Spring

Bayesian Inference. p(y)

Computational Cognitive Science

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

Bayesian Analysis for Natural Language Processing Lecture 2

Computational Cognitive Science

(1) Introduction to Bayesian statistics

Bayesian RL Seminar. Chris Mansley September 9, 2008

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

HPD Intervals / Regions

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction to Machine Learning

COMP90051 Statistical Machine Learning

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

Bayesian Models in Machine Learning

Some Asymptotic Bayesian Inference (background to Chapter 2 of Tanner s book)

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

ECE521 W17 Tutorial 6. Min Bai and Yuhuai (Tony) Wu

Point Estimation. Vibhav Gogate The University of Texas at Dallas

ECE295, Data Assimila0on and Inverse Problems, Spring 2015

Bayesian statistics. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Bayesian Inference. Chapter 2: Conjugate models

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Bayesian Updating with Continuous Priors Class 13, Jeremy Orloff and Jonathan Bloom

Comparison of Bayesian and Frequentist Inference

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

PMR Learning as Inference

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Hierarchical Models & Bayesian Model Selection

(4) One-parameter models - Beta/binomial. ST440/550: Applied Bayesian Statistics

A primer on Bayesian statistics, with an application to mortality rate estimation

Inference Control and Driving of Natural Systems

CS 361: Probability & Statistics

Review of Probabilities and Basic Statistics

Probability Theory Review

STA 250: Statistics. Notes 7. Bayesian Approach to Statistics. Book chapters: 7.2

CPSC 540: Machine Learning

CPSC 340: Machine Learning and Data Mining

MLE/MAP + Naïve Bayes

L14. 1 Lecture 14: Crash Course in Probability. July 7, Overview and Objectives. 1.2 Part 1: Probability

MATH MW Elementary Probability Course Notes Part I: Models and Counting

Announcements. Proposals graded

Bayesian Methods: Naïve Bayes

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Introduction to Applied Bayesian Modeling. ICPSR Day 4

Bayesian Inference: Concept and Practice

Foundations of Statistical Inference

The Random Variable for Probabilities Chris Piech CS109, Stanford University

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

Announcements. Proposals graded

Modeling Environment

CS 6140: Machine Learning Spring 2016

Detection and Estimation Chapter 1. Hypothesis Testing

Bayesian hypothesis testing

STA Module 4 Probability Concepts. Rev.F08 1

Lecture 13 Fundamentals of Bayesian Inference

STAT 425: Introduction to Bayesian Analysis

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

The Exciting Guide To Probability Distributions Part 2. Jamie Frost v1.1

Announcements. Lecture 5: Probability. Dangling threads from last week: Mean vs. median. Dangling threads from last week: Sampling bias

Learning Bayesian network : Given structure and completely observed data

Applied Bayesian Statistics STAT 388/488

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Nonparametric Bayesian Methods - Lecture I

Introduc)on to Bayesian Methods

Machine Learning

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

1.1 Administrative Stuff

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

2016 SISG Module 17: Bayesian Statistics for Genetics Lecture 3: Binomial Sampling

Principles of Bayesian Inference

Machine Learning. Instructor: Pranjal Awasthi

Conjugate Priors: Beta and Normal Spring 2018

Lecture 18: Learning probabilistic models

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Conjugate Priors: Beta and Normal Spring 2018

18.05 Practice Final Exam

Time Series and Dynamic Models

MLE/MAP + Naïve Bayes

Generative Models for Discrete Data

Bayesian Regression Linear and Logistic Regression

Bayesian Statistics. Debdeep Pati Florida State University. February 11, 2016

The Metropolis-Hastings Algorithm. June 8, 2012

The Random Variable for Probabilities Chris Piech CS109, Stanford University

Expectation maximization tutorial

Transcription:

Intro to Bayesian Methods Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601 Lecture 1 1

Course Webpage Syllabus LaTeX reference manual R markdown reference manual Please come to office hours for all questions. Office hours are not a review period if you cannot come to class. Join Google group Graded on Labs/HWs, Exams. Labs/HWs and Exams.R markdown format (it must compile). Nothing late will be accepted. You re lowest homework will be dropped. Announcements: Emails or in class. All your lab/homework assignments will be uploaded to Sakai. How to reach me and TAs email or Google. 2

Expectations Class is optional but you are expected to know everything covered in lecture. Not everything will always be on the slides. 2 Exams: in class, timed. Closed book, closed notes. (Dates are on the syllabus). There are NO make up exams. Late assignments will not be accepted. Don t ask. Final exam: during finals week. You should be reading the book as we go through the material in class. 3

Expectations for Homework Your write ups should be clearly written. Proofs: show all details. Data analysis: clearly explain. For data analysis questions, don t just turn in code. Code must be well documented. Code style: https://google.github.io/styleguide/rguide.xml For all homeworks, can use Markdown or LaTex. You must include all files that lead to your solutions (this includes code)! 4

Things available to you! Come to office hours. We want to help you learn! Supplementary reading to go with the notes by yours truly. (Beware of typos). Undergrad level notes PhD level notes Example form of write up in.rmd on Sakai (Module 0). You should have your homeworks graded and returned within one week by the TA s! 5

Why should we learn about Bayesian concepts? Natural if thinking about unknown parameters as random. They naturally give a full distribution when we perform an update. We automatically get uncertainty quantification. Drawbacks: They can be slow and inconsistent. 6

Record linkage Record linkage is the process of merging together noisy databases to remove duplicate entries. 7

8

9

10

These are clearly not the same Steve Fienberg! 10

Syrian Civil War 11

12 Bayesian Model Define α l (w) = relative frequency of w in data for field l.

12 Bayesian Model Define α l (w) = relative frequency of w in data for field l. G l : empirical distribution for field l.

12 Bayesian Model Define α l (w) = relative frequency of w in data for field l. G l : empirical distribution for field l. W F l (w 0 ): P (W = w) α l (w) exp[ c d(w, w 0 )], where d(, ) is a string metric and c > 0.

12 Bayesian Model Define α l (w) = relative frequency of w in data for field l. G l : empirical distribution for field l. W F l (w 0 ): P (W = w) α l (w) exp[ c d(w, w 0 )], where d(, ) is a string metric and c > 0. δ(y λijl) if z ijl = 0 X ijl λ ij, Y λijl, z ijl F l (Y λijl) if z ijl = 1 and field l is string-valued if z ijl = 1 and field l is categorical G l Y j l G l z ijl β il Bernoulli(β il ) β il Beta(a, b) λ ij DiscreteUniform(1,..., N max ), where N max = k i=1 n i

13 The model I showed you is very complicated. This course will give you an intro to Bayesian models and methods. Often Bayesian models are hard to work with, so we ll learn about approximations. The above record linkage problem is one that needs such an approximation.

Bayesian traces its origin to the 18th century and English Reverend Thomas Bayes, who along with Pierre-Simon Laplace discovered what we now call Bayes Theorem. 14

14 Bayesian traces its origin to the 18th century and English Reverend Thomas Bayes, who along with Pierre-Simon Laplace discovered what we now call Bayes Theorem. p(θ x) = p(x θ)p(θ) p(x) p(x θ)p(θ). (1)

14 Bayesian traces its origin to the 18th century and English Reverend Thomas Bayes, who along with Pierre-Simon Laplace discovered what we now call Bayes Theorem. p(θ x) = p(x θ)p(θ) p(x) p(x θ)p(θ). (1) We can decompose Bayes Theorem into three principal terms:

14 Bayesian traces its origin to the 18th century and English Reverend Thomas Bayes, who along with Pierre-Simon Laplace discovered what we now call Bayes Theorem. p(θ x) = p(x θ)p(θ) p(x) p(x θ)p(θ). (1) We can decompose Bayes Theorem into three principal terms: p(θ x) posterior

14 Bayesian traces its origin to the 18th century and English Reverend Thomas Bayes, who along with Pierre-Simon Laplace discovered what we now call Bayes Theorem. p(θ x) = p(x θ)p(θ) p(x) p(x θ)p(θ). (1) We can decompose Bayes Theorem into three principal terms: p(θ x) p(x θ) posterior likelihood

14 Bayesian traces its origin to the 18th century and English Reverend Thomas Bayes, who along with Pierre-Simon Laplace discovered what we now call Bayes Theorem. p(θ x) = p(x θ)p(θ) p(x) p(x θ)p(θ). (1) We can decompose Bayes Theorem into three principal terms: p(θ x) p(x θ) p(θ) posterior likelihood prior

15 Polling Example 2012 Let s apply this to a real example! We re interested in the proportion of people that approve of President Obama in PA.

15 Polling Example 2012 Let s apply this to a real example! We re interested in the proportion of people that approve of President Obama in PA. We take a random sample of 10 people in PA and find that 6 approve of President Obama.

15 Polling Example 2012 Let s apply this to a real example! We re interested in the proportion of people that approve of President Obama in PA. We take a random sample of 10 people in PA and find that 6 approve of President Obama. The national approval rating (Zogby poll) of President Obama in mid-december was 45%. We ll assume that in PA his approval rating is approximately 50%.

15 Polling Example 2012 Let s apply this to a real example! We re interested in the proportion of people that approve of President Obama in PA. We take a random sample of 10 people in PA and find that 6 approve of President Obama. The national approval rating (Zogby poll) of President Obama in mid-december was 45%. We ll assume that in PA his approval rating is approximately 50%. Based on this prior information, we ll use a Beta prior for θ and we ll choose a and b. (Won t get into this here).

15 Polling Example 2012 Let s apply this to a real example! We re interested in the proportion of people that approve of President Obama in PA. We take a random sample of 10 people in PA and find that 6 approve of President Obama. The national approval rating (Zogby poll) of President Obama in mid-december was 45%. We ll assume that in PA his approval rating is approximately 50%. Based on this prior information, we ll use a Beta prior for θ and we ll choose a and b. (Won t get into this here). We can plot the prior and likelihood distributions in R and then see how the two mix to form the posterior distribution.

16 Density 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Prior 0.0 0.2 0.4 0.6 0.8 1.0 θ

17 Density 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Prior Likelihood 0.0 0.2 0.4 0.6 0.8 1.0 θ

18 Density 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Prior Likelihood Posterior 0.0 0.2 0.4 0.6 0.8 1.0 θ

19 The basic philosophical difference between the frequentist and Bayesian paradigms is that Bayesians treat an unknown parameter θ as random.

19 The basic philosophical difference between the frequentist and Bayesian paradigms is that Bayesians treat an unknown parameter θ as random. Frequentists treat θ as unknown but fixed.

20 Stopping Rule Let θ be the probability of a particular coin landing on heads, and suppose we want to test the hypotheses

20 Stopping Rule Let θ be the probability of a particular coin landing on heads, and suppose we want to test the hypotheses H 0 : θ = 1/2, H 1 : θ > 1/2 at a significance level of α = 0.05. Suppose we observe the following sequence of flips: heads, heads, heads, heads, heads, tails (5 heads, 1 tails)

20 Stopping Rule Let θ be the probability of a particular coin landing on heads, and suppose we want to test the hypotheses H 0 : θ = 1/2, H 1 : θ > 1/2 at a significance level of α = 0.05. Suppose we observe the following sequence of flips: heads, heads, heads, heads, heads, tails (5 heads, 1 tails) To perform a frequentist hypothesis test, we must define a random variable to describe the data.

20 Stopping Rule Let θ be the probability of a particular coin landing on heads, and suppose we want to test the hypotheses H 0 : θ = 1/2, H 1 : θ > 1/2 at a significance level of α = 0.05. Suppose we observe the following sequence of flips: heads, heads, heads, heads, heads, tails (5 heads, 1 tails) To perform a frequentist hypothesis test, we must define a random variable to describe the data. The proper way to do this depends on exactly which of the following two experiments was actually performed:

Suppose the experiment is Flip six times and record the results. 21

21 Suppose the experiment is Flip six times and record the results. X counts the number of heads, and X Binomial(6, θ). The observed data was x = 5, and the p-value of our hypothesis test is

21 Suppose the experiment is Flip six times and record the results. X counts the number of heads, and X Binomial(6, θ). The observed data was x = 5, and the p-value of our hypothesis test is p-value = P θ=1/2 (X 5) = P θ=1/2 (X = 5) + P θ=1/2 (X = 6)

21 Suppose the experiment is Flip six times and record the results. X counts the number of heads, and X Binomial(6, θ). The observed data was x = 5, and the p-value of our hypothesis test is p-value = P θ=1/2 (X 5) = P θ=1/2 (X = 5) + P θ=1/2 (X = 6) = 6 64 + 1 64 = 7 64 = 0.109375 > 0.05.

21 Suppose the experiment is Flip six times and record the results. X counts the number of heads, and X Binomial(6, θ). The observed data was x = 5, and the p-value of our hypothesis test is p-value = P θ=1/2 (X 5) = P θ=1/2 (X = 5) + P θ=1/2 (X = 6) = 6 64 + 1 64 = 7 64 = 0.109375 > 0.05. So we fail to reject H 0 at α = 0.05.

Suppose now the experiment is Flip until we get tails. 22

22 Suppose now the experiment is Flip until we get tails. X counts the number of the flip on which the first tails occurs, and X Geometric(1 θ). The observed data was x = 6, and the p-value of our hypothesis test is p-value = P θ=1/2 (X 6)

22 Suppose now the experiment is Flip until we get tails. X counts the number of the flip on which the first tails occurs, and X Geometric(1 θ). The observed data was x = 6, and the p-value of our hypothesis test is p-value = P θ=1/2 (X 6) = 1 P θ=1/2 (X < 6)

22 Suppose now the experiment is Flip until we get tails. X counts the number of the flip on which the first tails occurs, and X Geometric(1 θ). The observed data was x = 6, and the p-value of our hypothesis test is p-value = P θ=1/2 (X 6) = 1 P θ=1/2 (X < 6) 5 = 1 P θ=1/2 (X = x) x=1

22 Suppose now the experiment is Flip until we get tails. X counts the number of the flip on which the first tails occurs, and X Geometric(1 θ). The observed data was x = 6, and the p-value of our hypothesis test is p-value = P θ=1/2 (X 6) = 1 P θ=1/2 (X < 6) 5 = 1 P θ=1/2 (X = x) = 1 x=1 ( 1 2 + 1 4 + 1 8 + 1 16 + 1 ) 32 = 1 32 = 0.03125 < 0.05. So we reject H 0 at α = 0.05.

The conclusions differ, which seems strikes some people as absurd. 23

23 The conclusions differ, which seems strikes some people as absurd. P-values aren t close one is 3.5 times as large as the other.

23 The conclusions differ, which seems strikes some people as absurd. P-values aren t close one is 3.5 times as large as the other. The result our hypothesis test depends on whether we would have stopped flipping if we had gotten a tails sooner.

23 The conclusions differ, which seems strikes some people as absurd. P-values aren t close one is 3.5 times as large as the other. The result our hypothesis test depends on whether we would have stopped flipping if we had gotten a tails sooner. The tests are dependent on what we call the stopping rule.

24 The likelihood for the actual value of x that was observed is the same for both experiments (up to a constant): p(x θ) θ 5 (1 θ).

24 The likelihood for the actual value of x that was observed is the same for both experiments (up to a constant): p(x θ) θ 5 (1 θ). A Bayesian approach would take the data into account only through this likelihood.

24 The likelihood for the actual value of x that was observed is the same for both experiments (up to a constant): p(x θ) θ 5 (1 θ). A Bayesian approach would take the data into account only through this likelihood. This would provide the same answers regardless of which experiment was being performed. The Bayesian analysis is independent of the stopping rule since it only depends on the likelihood (show this at home!).

25 Hierarchical Bayesian Models In a hierarchical Bayesian model, rather than specifying the prior distribution as a single function, we specify it as a hierarchy.

26 Hierarchical Bayesian Models X θ f(x θ) Θ γ π(θ γ) Γ φ(γ), where we assume that φ(γ) is known and not dependent on any other unknown hyperparameters.

27 Conjugate Distributions Let F be the class of sampling distributions p(y θ).

27 Conjugate Distributions Let F be the class of sampling distributions p(y θ). Then let P denote the class of prior distributions on θ.

27 Conjugate Distributions Let F be the class of sampling distributions p(y θ). Then let P denote the class of prior distributions on θ. Then P is said to be conjugate to F if for every p(θ) P and p(y θ) F, p(θ y) P. Simple definition: A family of priors such that, upon being multiplied by the likelihood, yields a posterior in the same family.

28 Beta-Binomial If X θ is distributed as binomial(n, θ), then a conjugate prior is the beta family of distributions, where we can show that the posterior is

28 Beta-Binomial If X θ is distributed as binomial(n, θ), then a conjugate prior is the beta family of distributions, where we can show that the posterior is π(θ x) p(x θ)p(θ)

28 Beta-Binomial If X θ is distributed as binomial(n, θ), then a conjugate prior is the beta family of distributions, where we can show that the posterior is π(θ x) p(x θ)p(θ) ( ) n θ x n x Γ(a + b) (1 θ) x Γ(a)Γ(b) θa 1 (1 θ) b 1

28 Beta-Binomial If X θ is distributed as binomial(n, θ), then a conjugate prior is the beta family of distributions, where we can show that the posterior is π(θ x) p(x θ)p(θ) ( ) n θ x n x Γ(a + b) (1 θ) x Γ(a)Γ(b) θa 1 (1 θ) b 1 θ x (1 θ) n x θ a 1 (1 θ) b 1

28 Beta-Binomial If X θ is distributed as binomial(n, θ), then a conjugate prior is the beta family of distributions, where we can show that the posterior is π(θ x) p(x θ)p(θ) ( ) n θ x n x Γ(a + b) (1 θ) x Γ(a)Γ(b) θa 1 (1 θ) b 1 θ x (1 θ) n x θ a 1 (1 θ) b 1 θ x+a 1 (1 θ) n x+b 1 =

28 Beta-Binomial If X θ is distributed as binomial(n, θ), then a conjugate prior is the beta family of distributions, where we can show that the posterior is π(θ x) p(x θ)p(θ) ( ) n θ x n x Γ(a + b) (1 θ) x Γ(a)Γ(b) θa 1 (1 θ) b 1 θ x (1 θ) n x θ a 1 (1 θ) b 1 θ x+a 1 (1 θ) n x+b 1 = θ x Beta(x + a, n x + b).