Conjugate Priors, Uninformative Priors

Similar documents
CSC321 Lecture 18: Learning Probabilistic Models

CPSC 540: Machine Learning

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Introduction to Machine Learning

Computational Perception. Bayesian Inference

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Compute f(x θ)f(θ) dθ

Hierarchical Models & Bayesian Model Selection

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

A primer on Bayesian statistics, with an application to mortality rate estimation

Intro to Bayesian Methods

Bayesian Inference. Chapter 2: Conjugate models

Lecture 2: Priors and Conjugacy

1. Fisher Information

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

Bayesian Models in Machine Learning

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Probability and Estimation. Alan Moses

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

The binomial model. Assume a uniform prior distribution on p(θ). Write the pdf for this distribution.

Bayesian hypothesis testing

Bayesian RL Seminar. Chris Mansley September 9, 2008

The Exciting Guide To Probability Distributions Part 2. Jamie Frost v1.1

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Outline. 1. Define likelihood 2. Interpretations of likelihoods 3. Likelihood plots 4. Maximum likelihood 5. Likelihood ratio benchmarks

Computational Cognitive Science

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Introduction to Probabilistic Machine Learning

CS 361: Probability & Statistics

10. Exchangeability and hierarchical models Objective. Recommended reading

Linear Models A linear model is defined by the expression

Bayesian Inference. p(y)

Homework 5: Conditional Probability Models

The Random Variable for Probabilities Chris Piech CS109, Stanford University

(4) One-parameter models - Beta/binomial. ST440/550: Applied Bayesian Statistics

The Random Variable for Probabilities Chris Piech CS109, Stanford University

Patterns of Scalable Bayesian Inference Background (Session 1)

INTRODUCTION TO BAYESIAN ANALYSIS

CS540 Machine learning L9 Bayesian statistics

STAT 425: Introduction to Bayesian Analysis

Computational Cognitive Science

Introduction to Machine Learning

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Lecture 18: Learning probabilistic models

STAT J535: Chapter 5: Classes of Bayesian Priors

Point Estimation. Vibhav Gogate The University of Texas at Dallas

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

Introduction to Machine Learning

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

The devil is in the denominator

CS 361: Probability & Statistics

Discrete Binary Distributions

Lecture 3. Discrete Random Variables

2016 SISG Module 17: Bayesian Statistics for Genetics Lecture 3: Binomial Sampling

PMR Learning as Inference

Introduction to Bayesian Statistics

ECE521 W17 Tutorial 6. Min Bai and Yuhuai (Tony) Wu

The Jeffreys Prior. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) The Jeffreys Prior MATH / 13

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

General Bayesian Inference I

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Lecture 6. Prior distributions

Bayesian Updating with Continuous Priors Class 13, Jeremy Orloff and Jonathan Bloom

Bayesian Analysis for Natural Language Processing Lecture 2

Review of Probabilities and Basic Statistics

Learning Bayesian network : Given structure and completely observed data

(1) Introduction to Bayesian statistics

STA 250: Statistics. Notes 7. Bayesian Approach to Statistics. Book chapters: 7.2

Introduction to Applied Bayesian Modeling. ICPSR Day 4

(3) Review of Probability. ST440/540: Applied Bayesian Statistics

Conditional distributions

Introduction to Machine Learning

Outline. 1. Define likelihood 2. Interpretations of likelihoods 3. Likelihood plots 4. Maximum likelihood 5. Likelihood ratio benchmarks

Introduction to Bayesian Learning. Machine Learning Fall 2018

Conjugate Priors: Beta and Normal Spring 2018

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

Maximum Likelihood Estimation

Non-parametric Methods

Parametric Techniques Lecture 3

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Introduction to Machine Learning CMU-10701

Introduction: exponential family, conjugacy, and sufficiency (9/2/13)

3 Multiple Discrete Random Variables

Introduc)on to Bayesian Methods

Probability Density Functions and the Normal Distribution. Quantitative Understanding in Biology, 1.2

Chapter 3.3 Continuous distributions

Frequentist Statistics and Hypothesis Testing Spring

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Introduction to Bayesian Methods

Bayesian Methods for Machine Learning

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Bayesian Methods: Naïve Bayes

Discrete Probability Distributions

Midterm Examination. STA 215: Statistical Inference. Due Wednesday, 2006 Mar 8, 1:15 pm

Time Series and Dynamic Models

APM 541: Stochastic Modelling in Biology Bayesian Inference. Jay Taylor Fall Jay Taylor (ASU) APM 541 Fall / 53

Transcription:

Conjugate Priors, Uninformative Priors Nasim Zolaktaf UBC Machine Learning Reading Group January 2016

Outline Exponential Families Conjugacy Conjugate priors Mixture of conjugate prior Uninformative priors Jeffreys prior

The Exponential Family A probability mass function (pmf) or probability distribution function (pdf) p(x θ), for X = (X 1,..., X m ) X m and θ R d, is said to be in the exponential family if it is the form: p(x θ) = 1 Z(θ) h(x) exp[θt φ(x)] (1)

The Exponential Family A probability mass function (pmf) or probability distribution function (pdf) p(x θ), for X = (X 1,..., X m ) X m and θ R d, is said to be in the exponential family if it is the form: p(x θ) = 1 Z(θ) h(x) exp[θt φ(x)] (1) = h(x) exp[θ T φ(x) A(θ)] (2)

The Exponential Family A probability mass function (pmf) or probability distribution function (pdf) p(x θ), for X = (X 1,..., X m ) X m and θ R d, is said to be in the exponential family if it is the form: p(x θ) = 1 Z(θ) h(x) exp[θt φ(x)] (1) = h(x) exp[θ T φ(x) A(θ)] (2) where Z(θ) = h(x) exp[θ T φ(x)]dx X m (3) A(θ) = log Z(θ) (4)

The Exponential Family A probability mass function (pmf) or probability distribution function (pdf) p(x θ), for X = (X 1,..., X m ) X m and θ R d, is said to be in the exponential family if it is the form: p(x θ) = 1 Z(θ) h(x) exp[θt φ(x)] (1) = h(x) exp[θ T φ(x) A(θ)] (2) where Z(θ) = h(x) exp[θ T φ(x)]dx X m (3) A(θ) = log Z(θ) (4) Z(θ) is called the partition function, A(θ) is called the log partition function or cumulant function, and h(x) is a scaling constant.

The Exponential Family A probability mass function (pmf) or probability distribution function (pdf) p(x θ), for X = (X 1,..., X m ) X m and θ R d, is said to be in the exponential family if it is the form: p(x θ) = 1 Z(θ) h(x) exp[θt φ(x)] (1) = h(x) exp[θ T φ(x) A(θ)] (2) where Z(θ) = h(x) exp[θ T φ(x)]dx X m (3) A(θ) = log Z(θ) (4) Z(θ) is called the partition function, A(θ) is called the log partition function or cumulant function, and h(x) is a scaling constant. Equation 2 can be generalized by writing p(x θ) = h(x) exp[η(θ) T φ(x) A(η(θ))] (5)

Binomial Distribution As an example of a discrete exponential family, consider the Binomial distribution with known number of trials n. The pmf for this distribution is p(x θ) = Binomial(n, θ) = ( ) n θ x (1 θ) n x, x {0, 1,.., n} (6) x

Binomial Distribution As an example of a discrete exponential family, consider the Binomial distribution with known number of trials n. The pmf for this distribution is p(x θ) = Binomial(n, θ) = ( ) n θ x (1 θ) n x, x {0, 1,.., n} (6) x This can equivalently be written as ( ) n θ p(x θ) = exp(x log( ) + n log(1 θ)) (7) x 1 θ which shows that the Binomial distribution is an exponential family, whose natural parameter is η = log θ 1 θ (8)

Conjugacy Consider the posterior distribution p(θ X) with prior p(θ) and likelihood function p(x θ), where p(θ X) p(x θ)p(θ).

Conjugacy Consider the posterior distribution p(θ X) with prior p(θ) and likelihood function p(x θ), where p(θ X) p(x θ)p(θ). If the posterior distribution p(θ X) are in the same family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function p(x θ).

Conjugacy Consider the posterior distribution p(θ X) with prior p(θ) and likelihood function p(x θ), where p(θ X) p(x θ)p(θ). If the posterior distribution p(θ X) are in the same family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function p(x θ). All members of the exponential family have conjugate priors.

Brief List of Conjugate Models

The Conjugate Beta Prior The Beta distribution is conjugate to the Binomial distribution. p(θ x) = p(x θ)p(θ) = Binomial(n, θ) Beta(a, b) = ( ) n θ x n x Γ(a + b) (1 θ) x Γ(a)Γ(b) θ(a 1) (1 θ) b 1 θ x (1 θ) n x θ (a 1) (1 θ) b 1 (9)

The Conjugate Beta Prior The Beta distribution is conjugate to the Binomial distribution. p(θ x) = p(x θ)p(θ) = Binomial(n, θ) Beta(a, b) = ( ) n θ x n x Γ(a + b) (1 θ) x Γ(a)Γ(b) θ(a 1) (1 θ) b 1 θ x (1 θ) n x θ (a 1) (1 θ) b 1 (9) p(θ x) θ (x+a 1) (1 θ) n x+b 1 (10)

The Conjugate Beta Prior The Beta distribution is conjugate to the Binomial distribution. p(θ x) = p(x θ)p(θ) = Binomial(n, θ) Beta(a, b) = ( ) n θ x n x Γ(a + b) (1 θ) x Γ(a)Γ(b) θ(a 1) (1 θ) b 1 θ x (1 θ) n x θ (a 1) (1 θ) b 1 (9) p(θ x) θ (x+a 1) (1 θ) n x+b 1 (10) The posterior distribution is simply a Beta(x + a, n x + b) distribution. Effectively, our prior is just adding a 1 successes and b 1 failures to the dataset.

Coin Flipping Example: Model Use a Bernoulli likelihood for coin X landing heads, p(x = H θ) = θ, p(x = T θ) = 1 θ, p(x θ) = θ I(X= H ) (1 θ) I(X= T ).

Coin Flipping Example: Model Use a Bernoulli likelihood for coin X landing heads, p(x = H θ) = θ, p(x = T θ) = 1 θ, p(x θ) = θ I(X= H ) (1 θ) I(X= T ). Use a Beta prior for probability θ of heads, θ Beta (a, b), p(θ a, b) = θa 1 (1 θ) b 1 B(a, b) θ a 1 (1 θ) b 1 with B(a, b) = Γ(a)Γ(b) Γ(a+b).

Coin Flipping Example: Model Use a Bernoulli likelihood for coin X landing heads, p(x = H θ) = θ, p(x = T θ) = 1 θ, p(x θ) = θ I(X= H ) (1 θ) I(X= T ). Use a Beta prior for probability θ of heads, θ Beta (a, b), p(θ a, b) = θa 1 (1 θ) b 1 B(a, b) θ a 1 (1 θ) b 1 1 = 1 0 with B(a, b) = Γ(a)Γ(b) Γ(a+b). Remember that probabilities sum to one so we have p(θ a, b)dθ = 1 0 θ a 1 (1 θ) b 1 1 dθ = B(a, b) B(a, b) 1 0 θ a 1 (1 θ) b 1 dθ

Coin Flipping Example: Model Use a Bernoulli likelihood for coin X landing heads, p(x = H θ) = θ, p(x = T θ) = 1 θ, p(x θ) = θ I(X= H ) (1 θ) I(X= T ). Use a Beta prior for probability θ of heads, θ Beta (a, b), p(θ a, b) = θa 1 (1 θ) b 1 B(a, b) θ a 1 (1 θ) b 1 1 = 1 0 with B(a, b) = Γ(a)Γ(b) Γ(a+b). Remember that probabilities sum to one so we have p(θ a, b)dθ = 1 0 θ a 1 (1 θ) b 1 1 dθ = B(a, b) B(a, b) which helps us compute integrals since we have 1 0 1 θ a 1 (1 θ) b 1 dθ = B(a, b). 0 θ a 1 (1 θ) b 1 dθ

Coin Flipping Example: Posterior Our model is: X Ber(θ), θ Beta (a, b).

Coin Flipping Example: Posterior Our model is: X Ber(θ), θ Beta (a, b). If we observe HHH then our posterior distribution is p(θ HHH) = p(hhh θ)p(θ) p(hhh) p(hhh θ)p(θ) (Bayes rule) (p(hhh) is constant) = θ 3 (1 θ) 0 p(θ) (likelihood def n) = θ 3 (1 θ) 0 θ a 1 (1 θ) b 1 (prior def n) = θ (3+a) 1 (1 θ) b 1.

Coin Flipping Example: Posterior Our model is: X Ber(θ), θ Beta (a, b). If we observe HHH then our posterior distribution is p(θ HHH) = p(hhh θ)p(θ) p(hhh) p(hhh θ)p(θ) (Bayes rule) (p(hhh) is constant) = θ 3 (1 θ) 0 p(θ) (likelihood def n) = θ 3 (1 θ) 0 θ a 1 (1 θ) b 1 (prior def n) = θ (3+a) 1 (1 θ) b 1. Which we ve written in the form of a Beta distribution, θ HHH Beta(3 + a, b), which let s us skip computing the integral p(hhh).

Coin Flipping Example: Estimates If we observe HHH with Beta(1, 1) prior, then θ HHH Beta (3 + a, b) and our different estimates are:

Coin Flipping Example: Estimates If we observe HHH with Beta(1, 1) prior, then θ HHH Beta (3 + a, b) and our different estimates are: Maximum likelihood: ˆθ = n H n = 3 3 = 1.

Coin Flipping Example: Estimates If we observe HHH with Beta(1, 1) prior, then θ HHH Beta (3 + a, b) and our different estimates are: Maximum likelihood: MAP with uniform, ˆθ = ˆθ = n H n = 3 3 = 1. (3 + a) 1 (3 + a) + b 2 = 3 3 = 1.

Coin Flipping Example: Estimates If we observe HHH with Beta(1, 1) prior, then θ HHH Beta (3 + a, b) and our different estimates are: Maximum likelihood: MAP with uniform, Posterior predictive, ˆθ = p(h HHH) = ˆθ = n H n = 3 3 = 1. (3 + a) 1 (3 + a) + b 2 = 3 3 = 1. = = 1 0 1 0 1 0 p(h θ)p(θ HHH)dθ Ber(H θ)beta (θ 3 + a, b)dθ θbeta(θ 3 + a, b)dθ = E[θ] = (3 + a) = 4 = 0.8.

Coin Flipping Example: Effect of Prior We assume all θ equally likely and saw HHH, ML/MAP predict it will always land heads. Bayes predict probability of landing heads is only 80%. Takes into account other ways that HHH could happen.

Coin Flipping Example: Effect of Prior We assume all θ equally likely and saw HHH, ML/MAP predict it will always land heads. Bayes predict probability of landing heads is only 80%. Takes into account other ways that HHH could happen. Beta(1, 1) prior is like seeing HT or TH in our mind before we flip, Posterior predictive would be 3+1 3+1+1 = 0.80.

Coin Flipping Example: Effect of Prior We assume all θ equally likely and saw HHH, ML/MAP predict it will always land heads. Bayes predict probability of landing heads is only 80%. Takes into account other ways that HHH could happen. Beta(1, 1) prior is like seeing HT or TH in our mind before we flip, Posterior predictive would be 3+1 3+1+1 = 0.80. Beta(3, 3) prior is like seeing 3 heads and 3 tails (stronger uniform prior), Posterior predictive would be 3+3 3+3+3 = 0.667. Beta(100, 1) prior is like seeing 100 heads and 1 tail (biased), Posterior predictive would be 3+100 3+100+1 = 0.990. Beta(0.01, 0.01) biases towards having unfair coin (head or tail), Posterior predictive would be 3+0.01 3+0.01+0.01 = 0.997.

Coin Flipping Example: Effect of Prior We assume all θ equally likely and saw HHH, ML/MAP predict it will always land heads. Bayes predict probability of landing heads is only 80%. Takes into account other ways that HHH could happen. Beta(1, 1) prior is like seeing HT or TH in our mind before we flip, Posterior predictive would be 3+1 3+1+1 = 0.80. Beta(3, 3) prior is like seeing 3 heads and 3 tails (stronger uniform prior), Posterior predictive would be 3+3 3+3+3 = 0.667. Beta(100, 1) prior is like seeing 100 heads and 1 tail (biased), Posterior predictive would be 3+100 3+100+1 = 0.990. Beta(0.01, 0.01) biases towards having unfair coin (head or tail), 3+0.01 Posterior predictive would be = 0.997. 3+0.01+0.01 Dependence on (a, b) is where people get uncomfortable: But basically the same as choosing regularization parameter λ. If your prior knowledge isn t misleading, you will not overfit.

Mixtures of Conjugate Priors A mixture of conjugate priors is also conjugate. We can use a mixture of conjugate priors as a prior.

Mixtures of Conjugate Priors A mixture of conjugate priors is also conjugate. We can use a mixture of conjugate priors as a prior. For example, suppose we are modelling coin tosses, and we think the coin is either fair, or is biased towards heads. This cannot be represented by a Beta distribution. However, we can model it using a mixture of two Beta distributions. For example, we might use: p(θ) = 0.5 Beta(θ 20, 20) + 0.5 Beta(θ 30, 10) (11)

Mixtures of Conjugate Priors A mixture of conjugate priors is also conjugate. We can use a mixture of conjugate priors as a prior. For example, suppose we are modelling coin tosses, and we think the coin is either fair, or is biased towards heads. This cannot be represented by a Beta distribution. However, we can model it using a mixture of two Beta distributions. For example, we might use: p(θ) = 0.5 Beta(θ 20, 20) + 0.5 Beta(θ 30, 10) (11) If θ comes from the first distribution, the coin is fair, but if it comes from the second it is biased towards heads.

Mixtures of Conjugate Priors (Cont.) The prior has the form p(θ) = k p(z = k)p(θ z = k) (12) where z = k means that θ comes from mixture component k, p(z = k) are called the prior mixing weights, and each p(θ z = k) is conjugate.

Mixtures of Conjugate Priors (Cont.) The prior has the form p(θ) = k p(z = k)p(θ z = k) (12) where z = k means that θ comes from mixture component k, p(z = k) are called the prior mixing weights, and each p(θ z = k) is conjugate. Posterior can also be written as a mixture of conjugate distributions as follows: p(θ X) = k p(z = k X)p(θ X, z = k) (13) where p(z = k X) are the posterior mixing weights given by p(z = k)p(x z = k) p(z = k X) = k p(z = k X)p(θ X, z = k ) (14)

Uninformative Priors If we don t have strong beliefs about what θ should be, it is common to use an uninformative or non-informative prior, and to let the data speak for itself. Designing uninformative priors is tricky.

Uninformative Prior for the Bernoulli Consider the Bernoulli parameter p(x θ) = θ x (1 θ) n x (15)

Uninformative Prior for the Bernoulli Consider the Bernoulli parameter p(x θ) = θ x (1 θ) n x (15) What is the most uninformative prior for this distribution?

Uninformative Prior for the Bernoulli Consider the Bernoulli parameter p(x θ) = θ x (1 θ) n x (15) What is the most uninformative prior for this distribution? The uniform distribution U nif orm(0, 1)?.

Uninformative Prior for the Bernoulli Consider the Bernoulli parameter p(x θ) = θ x (1 θ) n x (15) What is the most uninformative prior for this distribution? The uniform distribution U nif orm(0, 1)?. This is equivalent to Beta(1, 1) on θ.

Uninformative Prior for the Bernoulli Consider the Bernoulli parameter p(x θ) = θ x (1 θ) n x (15) What is the most uninformative prior for this distribution? The uniform distribution U nif orm(0, 1)?. This is equivalent to Beta(1, 1) on θ. N 1 We can predict the MLE is N 1 +N 0, whereas the posterior mean is E[θ X] = N 1+1 N 1 +N 0. Prior isn t completely uninformative! +2

Uninformative Prior for the Bernoulli Consider the Bernoulli parameter p(x θ) = θ x (1 θ) n x (15) What is the most uninformative prior for this distribution? The uniform distribution U nif orm(0, 1)?. This is equivalent to Beta(1, 1) on θ. N 1 We can predict the MLE is N 1 +N 0, whereas the posterior mean is E[θ X] = N 1+1 N 1 +N 0. Prior isn t completely uninformative! +2

Uninformative Prior for the Bernoulli Consider the Bernoulli parameter p(x θ) = θ x (1 θ) n x (15) What is the most uninformative prior for this distribution? The uniform distribution U nif orm(0, 1)?. This is equivalent to Beta(1, 1) on θ. N 1 We can predict the MLE is N 1 +N 0, whereas the posterior mean is E[θ X] = N 1+1 N 1 +N 0. Prior isn t completely uninformative! +2 By decreasing the magnitude of the pseudo counts, we can lessen the impact of the prior. By this argument, the most uninformative prior is Beta(0, 0), which is called Haldane prior.

Uninformative Prior for the Bernoulli Consider the Bernoulli parameter p(x θ) = θ x (1 θ) n x (15) What is the most uninformative prior for this distribution? The uniform distribution U nif orm(0, 1)?. This is equivalent to Beta(1, 1) on θ. N 1 We can predict the MLE is N 1 +N 0, whereas the posterior mean is E[θ X] = N 1+1 N 1 +N 0. Prior isn t completely uninformative! +2 By decreasing the magnitude of the pseudo counts, we can lessen the impact of the prior. By this argument, the most uninformative prior is Beta(0, 0), which is called Haldane prior. Haldane prior is an improper prior; it does not integrate to 1.

Uninformative Prior for the Bernoulli Consider the Bernoulli parameter p(x θ) = θ x (1 θ) n x (15) What is the most uninformative prior for this distribution? The uniform distribution U nif orm(0, 1)?. This is equivalent to Beta(1, 1) on θ. N 1 We can predict the MLE is N 1 +N 0, whereas the posterior mean is E[θ X] = N 1+1 N 1 +N 0. Prior isn t completely uninformative! +2 By decreasing the magnitude of the pseudo counts, we can lessen the impact of the prior. By this argument, the most uninformative prior is Beta(0, 0), which is called Haldane prior. Haldane prior is an improper prior; it does not integrate to 1. Haldane prior results in the posterior Beta(x, n x) which will be proper as long as n x 0 and x 0.

Uninformative Prior for the Bernoulli Consider the Bernoulli parameter p(x θ) = θ x (1 θ) n x (15) What is the most uninformative prior for this distribution? The uniform distribution U nif orm(0, 1)?. This is equivalent to Beta(1, 1) on θ. N 1 We can predict the MLE is N 1 +N 0, whereas the posterior mean is E[θ X] = N 1+1 N 1 +N 0. Prior isn t completely uninformative! +2 By decreasing the magnitude of the pseudo counts, we can lessen the impact of the prior. By this argument, the most uninformative prior is Beta(0, 0), which is called Haldane prior. Haldane prior is an improper prior; it does not integrate to 1. Haldane prior results in the posterior Beta(x, n x) which will be proper as long as n x 0 and x 0. We will see that the right uninformative prior is Beta( 1 2, 1 2 ).

Jeffreys Prior Jeffrey argued that a uninformative prior should be invariant to the parametrization used. The key observation is that if p(θ) is uninformative, then any reparametrization of the prior, such as θ = h(φ) for some function h, should also be uninformative.

Jeffreys Prior Jeffrey argued that a uninformative prior should be invariant to the parametrization used. The key observation is that if p(θ) is uninformative, then any reparametrization of the prior, such as θ = h(φ) for some function h, should also be uninformative. Jeffreys prior is the prior that satisfies p(θ) deti(θ), where I(θ) is the Fisher information for θ, and is invariant under reparametrization of the parameter vector θ.

Jeffreys Prior Jeffrey argued that a uninformative prior should be invariant to the parametrization used. The key observation is that if p(θ) is uninformative, then any reparametrization of the prior, such as θ = h(φ) for some function h, should also be uninformative. Jeffreys prior is the prior that satisfies p(θ) deti(θ), where I(θ) is the Fisher information for θ, and is invariant under reparametrization of the parameter vector θ. The Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ which the probability of X depends. I(θ) = E[( 2 θ] θ log f(x; θ)) (16)

Jeffreys Prior Jeffrey argued that a uninformative prior should be invariant to the parametrization used. The key observation is that if p(θ) is uninformative, then any reparametrization of the prior, such as θ = h(φ) for some function h, should also be uninformative. Jeffreys prior is the prior that satisfies p(θ) deti(θ), where I(θ) is the Fisher information for θ, and is invariant under reparametrization of the parameter vector θ. The Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ which the probability of X depends. I(θ) = E[( 2 θ] θ log f(x; θ)) (16) If log f(x; θ) is twice differentiable with respect to θ, then I(θ) = E[ 2 log f(x; θ) θ] (17) θ2

Reparametrization for Jeffreys Prior: One parameter case For an alternative parametrization φ we can derive p ( φ) = I(φ) from p(θ) = I(θ), using the change of variables theorem and the definition of Fisher information: p(φ) = p(θ) dθ dφ I(θ)( dθ dφ )2 = = E[( d ln L dθ dθ dφ )2 ] = E[( d ln L dθ )2 ]( dθ dφ )2 E[( d ln L dφ )2 ] = I(φ) (18)

Jeffreys Prior for the Bernoulli Suppose X Ber(θ). The log-likelihood for a single sample is log p(x θ) = X logθ + (1 X) log(1 θ) (19)

Jeffreys Prior for the Bernoulli Suppose X Ber(θ). The log-likelihood for a single sample is log p(x θ) = X logθ + (1 X) log(1 θ) (19) The score function is gradient of log-likelihood s(θ) = d dθ log(px θ) = X θ 1 X 1 θ (20)

Jeffreys Prior for the Bernoulli Suppose X Ber(θ). The log-likelihood for a single sample is log p(x θ) = X logθ + (1 X) log(1 θ) (19) The score function is gradient of log-likelihood The observed information is s(θ) = d dθ log(px θ) = X θ 1 X 1 θ (20) J(θ) = d2 dθ 2 log p(x θ) = s (θ X) = X θ 2 + 1 X (1 θ) 2 (21)

Jeffreys Prior for the Bernoulli Suppose X Ber(θ). The log-likelihood for a single sample is log p(x θ) = X logθ + (1 X) log(1 θ) (19) The score function is gradient of log-likelihood The observed information is s(θ) = d dθ log(px θ) = X θ 1 X 1 θ (20) J(θ) = d2 dθ 2 log p(x θ) = s (θ X) = X θ 2 + 1 X (1 θ) 2 (21) The Fisher information is the expected information I(θ) = E[J(θ X) X θ] = θ θ 2 + Hence Jeffreys prior is 1 θ (1 θ) 2 = 1 θ(1 θ) (22) p(θ) θ 1 2 (1 θ) 1 2 Beta( 1 2, 1 ). (23) 2

Selected Related Work [1] Kevin P Murphy(2012) Machine learning: a probabilistic perspective MIT press [2] Jarad Niemi Conjugacy of prior distributions: https://www.youtube.com/watch?v=yhewyfqgjfa [3] Jarad Niemi Noninformative prior distributions: https://www.youtube.com/watch?v=25-ppmsragm [4] Fisher information: https://en.wikipedia.org/wiki/fisher_information