Lecture 2: Conjugate priors

Similar documents
Discrete Binary Distributions

Lecture 2: Priors and Conjugacy

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

1 Maximum Likelihood Estimation

Bayesian Models in Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Probabilistic Machine Learning

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Bayesian Methods: Naïve Bayes

Introduction to Machine Learning

Probabilistic Graphical Models

Computational Cognitive Science

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Computational Cognitive Science

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

Lecture 11: Probability Distributions and Parameter Estimation

Bayesian Inference and MCMC

The Exciting Guide To Probability Distributions Part 2. Jamie Frost v1.1

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Probability and Estimation. Alan Moses

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Introduction: exponential family, conjugacy, and sufficiency (9/2/13)

Lecture 10 and 11: Text and Discrete Distributions

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

CS 361: Probability & Statistics

More Spectral Clustering and an Introduction to Conjugacy

CS540 Machine learning L9 Bayesian statistics

ECE521 W17 Tutorial 6. Min Bai and Yuhuai (Tony) Wu

Learning Bayesian network : Given structure and completely observed data

Data Analysis and Uncertainty Part 2: Estimation

Language as a Stochastic Process

Bayesian Analysis for Natural Language Processing Lecture 2

CS 361: Probability & Statistics

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Bayesian RL Seminar. Chris Mansley September 9, 2008

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

Bayesian Inference: Posterior Intervals

A Very Brief Summary of Bayesian Inference, and Examples

CSC321 Lecture 18: Learning Probabilistic Models

Introduction to Probability and Statistics (Continued)

Hierarchical Models & Bayesian Model Selection

Probabilistic Graphical Models

Probability and Statistics

Chapter 5. Bayesian Statistics

Probability Background

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Introduction to Machine Learning CMU-10701

Foundations of Statistical Inference

PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.

Bayesian Methods for Machine Learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

Bayesian Regression Linear and Logistic Regression

CPSC 540: Machine Learning

Continuous Random Variables and Continuous Distributions

Computational Perception. Bayesian Inference

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Density Estimation. Seungjin Choi

Lecture 17: The Exponential and Some Related Distributions

Bayesian Statistics Part III: Building Bayes Theorem Part IV: Prior Specification

Expectation Propagation Algorithm

Compute f(x θ)f(θ) dθ

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

Machine Learning using Bayesian Approaches

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Stat 5101 Notes: Brand Name Distributions

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

Probabilistic Graphical Models

COS513 LECTURE 8 STATISTICAL CONCEPTS

Some Asymptotic Bayesian Inference (background to Chapter 2 of Tanner s book)


Advanced topics from statistics

Parameter Learning With Binary Variables

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

STAB57: Quiz-1 Tutorial 1 (Show your work clearly) 1. random variable X has a continuous distribution for which the p.d.f.

USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL*

Quantitative Biology II Lecture 4: Variational Methods

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Introduction to Bayesian Statistics

PMR Learning as Inference

Artificial Intelligence

Curve Fitting Re-visited, Bishop1.2.5

Common probability distributionsi Math 217 Probability and Statistics Prof. D. Joyce, Fall 2014


Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

STA 4273H: Statistical Machine Learning

The Expectation-Maximization Algorithm

Prior Choice, Summarizing the Posterior

Exponential Families

Transcription:

(Spring ʼ) Lecture : Conjugate priors Julia Hockenmaier juliahmr@illinois.edu Siebel Center http://www.cs.uiuc.edu/class/sp/cs98jhm The binomial distribution If p is the probability of heads, the probability of getting exactly k heads in n independent yes/no trials is given by the binomial distribution Bin(n,p): P (k heads) Expectation E(Bin(n,p)) np Variance var(bin(n,p)) np(-p) n p k ( p) n k k n k(n k) pk ( p) n k Binomial likelihood What distribution does p (probability of heads) have, given that the data D consists of #H heads and #T tails? Parameter estimation Given a set of data DHTTHTT, what is the probability of heads?.7...... Likelihood L( ;D(#Heads,#Tails)) for binomial distribution L(,(,)) L(,(,7)) L(,(,8)) - Maximum likelihood estimation (MLE): Use the which has the highest likelihood P(D ). P (x H D) P (x H θ) withθ arg max P (D θ) θ - Bayesian estimation: Compute the expectation of given D: P (x H D) P (x H θ)p (θ D)dθ E[θ D]....8 CS98JHM: Advanced NLP

Maximum likelihood estimation - Maximum likelihood estimation (MLE): find which maximizes likelihood P(D ). θ arg max P (D θ) θ arg max θ H ( θ) T θ H H + T - Data D provides evidence for or against our beliefs. We update our belief based on the evidence we see: P (θ D) Posterior Bayesian statistics Prior Likelihood P (θ)p (D θ) P (θ)p (D θ)dθ Marginal Likelihood (P(D)) Bayesian estimation Given a prior P() and a likelihood P(D ), what is the posterior P( D)? How do we choose the prior P()? - The posterior is proportional to prior x likelihood: P( D) P() P(D ) - The likelihood of a binomial is: P(D ) H (-) T - If prior P() is proportional to powers of and (-), posterior will also be proportional to powers of and (-): P() a (-) b " P( D) a (-) b H (-) T a+h (-) b+t In search of a prior... We would like something of the form: P (θ) θ a ( θ) b But -- this looks just like the binomial: n P (k heads) p k ( p) n k k n k(n k) pk ( p) n k. except that k is an integer and is a real with < <. 7 8

The Gamma function The Gamma function "(x) is the generalization of the factorial x (or rather (x-)) to the reals: Γ(α) For x >, "(x) (x-)"(x-). x α e x dx for α > For positive integers, "(x) (x-) The Gamma function (x) function 9 The Beta distribution Beta(",#) with " >, # > A random variable X ( < x < ) has a Beta distribution with (hyper)parameters " (" > ) and # (# > ) if X has a continuous distribution with probability density function P (x α, β) Γ(α + β) Γ(α)Γ(β) xα ( x) β Unimodal 7 Beta(.,.) Beta(,.) Beta(,) Beta(,) Beta(,) The first term is a normalization factor (to obtain a distribution) Expectation: x α ( x) β dx α α+β Γ(α + β) Γ(α)Γ(β)....8

Beta(",#) with " <, # < Beta(",#) with "# U-shaped Beta(.,.) Beta(.,.) Beta(.,.) Symmetric. #$: uniform.... Beta(.,.) Beta(,) Beta(,)....8....8 Beta(",#) with "<, # > Beta(",#) with ", # > Strictly decreasing 8 7 Beta(.,.) Beta(.,.) Beta(.,) #, < $ < : strictly concave. #, $ : straight line #, $ > : strictly convex... Beta(,.) Beta(,) Beta(,)....8.....8

Beta as prior for binomial Given a prior P( #,$) Beta(#,$), and data D(H,T), what is our posterior? P (θ α, β,h,t) P (H, T θ)p (θ α, β) θ H ( θ) T θ α ( θ) β θ H+α ( θ) T +β With normalization P (θ α, β,h,t) Γ(H + α + T + β) Γ(H + α)γ(t + β) θh+α ( θ) T +β Beta(α + H, β + T ) 7 So, what do we predict? Our Bayesian estimate for the next coin flip P(x D): P (x H D) E[θ D] P (x H θ)p (θ D)dθ θp (θ D)dθ E[Beta(H + α,t + β)] H + α H + α + T + β 8 Conjugate priors The beta distribution is a conjugate prior to the binomial: the resulting posterior is also a beta distribution. All members of the exponential family of distributions have conjugate priors. Examples: - Multinomial: conjugate prior Dirichlet - Gaussian: conjugate prior Gaussian Multinomials: Dirichlet prior Multinomial distribution: Probability of observing each possible outcome ci exactly Xi times in a sequence of n yes/no trials: P (X x i,...,x K x k ) Dirichlet prior: n x x K θx θxk K Dir(θ α,...α k ) Γ(α +... + α k ) Γ(α )...Γ(α k ) k θ α k k if N x i n i 9

More about conjugate priors - We can interpret the hyperparameters as pseudocounts Todayʼs reading - Bishop, Pattern Recognition and Machine Learning, Ch. - Sequential estimation (updating counts after each observation) gives same results as batch estimation - Add-one smoothing (Laplace smoothing) uniform prior - On average, more data leads to a sharper posterior (sharper lower variance)