(Spring ʼ) Lecture : Conjugate priors Julia Hockenmaier juliahmr@illinois.edu Siebel Center http://www.cs.uiuc.edu/class/sp/cs98jhm The binomial distribution If p is the probability of heads, the probability of getting exactly k heads in n independent yes/no trials is given by the binomial distribution Bin(n,p): P (k heads) Expectation E(Bin(n,p)) np Variance var(bin(n,p)) np(-p) n p k ( p) n k k n k(n k) pk ( p) n k Binomial likelihood What distribution does p (probability of heads) have, given that the data D consists of #H heads and #T tails? Parameter estimation Given a set of data DHTTHTT, what is the probability of heads?.7...... Likelihood L( ;D(#Heads,#Tails)) for binomial distribution L(,(,)) L(,(,7)) L(,(,8)) - Maximum likelihood estimation (MLE): Use the which has the highest likelihood P(D ). P (x H D) P (x H θ) withθ arg max P (D θ) θ - Bayesian estimation: Compute the expectation of given D: P (x H D) P (x H θ)p (θ D)dθ E[θ D]....8 CS98JHM: Advanced NLP
Maximum likelihood estimation - Maximum likelihood estimation (MLE): find which maximizes likelihood P(D ). θ arg max P (D θ) θ arg max θ H ( θ) T θ H H + T - Data D provides evidence for or against our beliefs. We update our belief based on the evidence we see: P (θ D) Posterior Bayesian statistics Prior Likelihood P (θ)p (D θ) P (θ)p (D θ)dθ Marginal Likelihood (P(D)) Bayesian estimation Given a prior P() and a likelihood P(D ), what is the posterior P( D)? How do we choose the prior P()? - The posterior is proportional to prior x likelihood: P( D) P() P(D ) - The likelihood of a binomial is: P(D ) H (-) T - If prior P() is proportional to powers of and (-), posterior will also be proportional to powers of and (-): P() a (-) b " P( D) a (-) b H (-) T a+h (-) b+t In search of a prior... We would like something of the form: P (θ) θ a ( θ) b But -- this looks just like the binomial: n P (k heads) p k ( p) n k k n k(n k) pk ( p) n k. except that k is an integer and is a real with < <. 7 8
The Gamma function The Gamma function "(x) is the generalization of the factorial x (or rather (x-)) to the reals: Γ(α) For x >, "(x) (x-)"(x-). x α e x dx for α > For positive integers, "(x) (x-) The Gamma function (x) function 9 The Beta distribution Beta(",#) with " >, # > A random variable X ( < x < ) has a Beta distribution with (hyper)parameters " (" > ) and # (# > ) if X has a continuous distribution with probability density function P (x α, β) Γ(α + β) Γ(α)Γ(β) xα ( x) β Unimodal 7 Beta(.,.) Beta(,.) Beta(,) Beta(,) Beta(,) The first term is a normalization factor (to obtain a distribution) Expectation: x α ( x) β dx α α+β Γ(α + β) Γ(α)Γ(β)....8
Beta(",#) with " <, # < Beta(",#) with "# U-shaped Beta(.,.) Beta(.,.) Beta(.,.) Symmetric. #$: uniform.... Beta(.,.) Beta(,) Beta(,)....8....8 Beta(",#) with "<, # > Beta(",#) with ", # > Strictly decreasing 8 7 Beta(.,.) Beta(.,.) Beta(.,) #, < $ < : strictly concave. #, $ : straight line #, $ > : strictly convex... Beta(,.) Beta(,) Beta(,)....8.....8
Beta as prior for binomial Given a prior P( #,$) Beta(#,$), and data D(H,T), what is our posterior? P (θ α, β,h,t) P (H, T θ)p (θ α, β) θ H ( θ) T θ α ( θ) β θ H+α ( θ) T +β With normalization P (θ α, β,h,t) Γ(H + α + T + β) Γ(H + α)γ(t + β) θh+α ( θ) T +β Beta(α + H, β + T ) 7 So, what do we predict? Our Bayesian estimate for the next coin flip P(x D): P (x H D) E[θ D] P (x H θ)p (θ D)dθ θp (θ D)dθ E[Beta(H + α,t + β)] H + α H + α + T + β 8 Conjugate priors The beta distribution is a conjugate prior to the binomial: the resulting posterior is also a beta distribution. All members of the exponential family of distributions have conjugate priors. Examples: - Multinomial: conjugate prior Dirichlet - Gaussian: conjugate prior Gaussian Multinomials: Dirichlet prior Multinomial distribution: Probability of observing each possible outcome ci exactly Xi times in a sequence of n yes/no trials: P (X x i,...,x K x k ) Dirichlet prior: n x x K θx θxk K Dir(θ α,...α k ) Γ(α +... + α k ) Γ(α )...Γ(α k ) k θ α k k if N x i n i 9
More about conjugate priors - We can interpret the hyperparameters as pseudocounts Todayʼs reading - Bishop, Pattern Recognition and Machine Learning, Ch. - Sequential estimation (updating counts after each observation) gives same results as batch estimation - Add-one smoothing (Laplace smoothing) uniform prior - On average, more data leads to a sharper posterior (sharper lower variance)