Exponential Families

Similar documents
LECTURE 11: EXPONENTIAL FAMILY AND GENERALIZED LINEAR MODELS

Generalized Linear Models and Exponential Families

Introduction: exponential family, conjugacy, and sufficiency (9/2/13)

Probabilistic Graphical Models for Image Analysis - Lecture 4

The exponential family: Conjugate priors

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Patterns of Scalable Bayesian Inference Background (Session 1)

13: Variational inference II

STAT J535: Chapter 5: Classes of Bayesian Priors

PMR Learning as Inference

Bayesian Methods for Machine Learning

Probability and Estimation. Alan Moses

Bayesian Regression (1/31/13)

Gaussian Models (9/9/13)

Stat 5101 Lecture Notes

Bayesian Models in Machine Learning

Lecture 2: Priors and Conjugacy

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

Probability Distributions

Parametric Techniques Lecture 3

Parametric Techniques

Generalized Linear Models (1/29/13)

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Lecture 16: Mixtures of Generalized Linear Models

COS513 LECTURE 8 STATISTICAL CONCEPTS

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

Introduction to Probabilistic Machine Learning

CPSC 540: Machine Learning

CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Brief Review of Probability

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Review and continuation from last week Properties of MLEs

Mixture Models and Expectation-Maximization

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Learning Bayesian network : Given structure and completely observed data

Mathematical statistics

The joint posterior distribution of the unknown parameters and hidden variables, given the

Bayesian Regression Linear and Logistic Regression

STA 4273H: Statistical Machine Learning

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Statistical Models. David M. Blei Columbia University. October 14, 2014

Experimental Design and Statistics - AGA47A

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Mathematical statistics

Lecture 2: Conjugate priors

Study Notes on the Latent Dirichlet Allocation

Maximum Likelihood Estimation of Dirichlet Distribution Parameters

Density Estimation. Seungjin Choi

Moments. Raw moment: February 25, 2014 Normalized / Standardized moment:

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

The Expectation-Maximization Algorithm

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS

MATH c UNIVERSITY OF LEEDS Examination for the Module MATH2715 (January 2015) STATISTICAL METHODS. Time allowed: 2 hours

Machine learning - HT Maximum Likelihood

MFM Practitioner Module: Quantitative Risk Management. John Dodson. September 23, 2015

Lecture 13 : Variational Inference: Mean Field Approximation

Severity Models - Special Families of Distributions

Review of Probabilities and Basic Statistics

Stat 5101 Notes: Brand Name Distributions

David B. Dahl. Department of Statistics, and Department of Biostatistics & Medical Informatics University of Wisconsin Madison

Probabilistic Graphical Models

CS540 Machine learning L9 Bayesian statistics

Generalized Linear Models

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Some Probability and Statistics

Data Mining Techniques. Lecture 3: Probability

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Bayesian data analysis in practice: Three simple examples

STA 4273H: Statistical Machine Learning

MIT Spring 2016

Definition 1.1 (Parametric family of distributions) A parametric distribution is a set of distribution functions, each of which is determined by speci

Exponential families also behave nicely under conditioning. Specifically, suppose we write η = (η 1, η 2 ) R k R p k so that

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

ICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

A Derivation of the EM Updates for Finding the Maximum Likelihood Parameter Estimates of the Student s t Distribution

Introduction to Bayesian learning Lecture 2: Bayesian methods for (un)supervised problems

Probabilistic Graphical Models

Qualifying Exam CS 661: System Simulation Summer 2013 Prof. Marvin K. Nakayama

MIT Spring 2016

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Subject CS1 Actuarial Statistics 1 Core Principles

Unsupervised Learning

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Chapter 3.3 Continuous distributions

Collapsed Variational Inference for Sum-Product Networks

Note 1: Varitional Methods for Latent Dirichlet Allocation

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Continuous Random Variables and Continuous Distributions

Stochastic Variational Inference

Chapter 5 continued. Chapter 5 sections

ABC methods for phase-type distributions with applications in insurance risk problems

Part 6: Multivariate Normal and Linear Models

Overfitting, Bias / Variance Analysis

Stat 710: Mathematical Statistics Lecture 12

Transcription:

Exponential Families David M. Blei 1 Introduction We discuss the exponential family, a very flexible family of distributions. Most distributions that you have heard of are in the exponential family. Bernoulli, Gaussian, Multinomial, Dirichlet, Gamma, Poisson, Beta Set-up An exponential family distribution has the following form, p(x η) = h(x) exp{η t(x) a(η)} (1) The different parts of this equation are The natural parameter η The sufficient statistic t(x) The underlying measure h(x), e.g., counting measure or Lebesgue measure The log normalizer a(η), a(η) = log h(x) exp{η t(x)}. () Here we integrate the unnormalized density over the sample space. This ensures that the density integrates to one. The statistic t(x) is called sufficient because the likelihood for η only depends on x through t(x). The exponential family has fundamental connections to the world of graphical models. For our purposes, we ll use exponential families as components in directed graphical models, e.g., in the mixtures of Gaussians. 1

3 The Gaussian distribution As a running example, consider the Gaussian distribution. The familiar form of the univariate Gaussian is { } p(x µ, σ 1 (x µ) ) = exp πσ σ (3) We put it in exponential family form by expanding the square p(x µ, σ ) = 1 { µ exp π σ x 1 σ x 1 } σ µ log σ (4) We see that η = µ/σ, 1/σ (5) t(x) = x, x (6) a(η) = µ /σ + log σ (7) = η1/4η (1/) log( η ) (8) h(x) = 1/ π (9) If you are new to this, work it out for others on the list. 4 Moments The derivatives of the log normalizer gives the moments of the sufficient statistics, d dη a(η) = d ( log exp { η t(x) } ) h(x)dx (10) dη { t(x) exp η t(x) } h(x)dx = (11) exp {η t(x)} h(x)dx = t(x) exp { η t(x) a(η) } h(x)dx (1) = E [t(x)] (13) The next derivatives are higher moments. The second derivative is the variance, etc. Let s go back to the Gaussian example.

The derivative with respect to η 1 is The derivative with respect to η is This means that the variance is da(η) dη 1 = η 1 η (14) = µ (15) = E[X] (16) da(η) = η 1 1 (17) dη 4η η = σ + µ (18) = E[X ] (19) Var(X) = E[X ] E[X] (0) = 1 η (1) In a minimal exponential family, the components of the sufficient statistics t(x) are linearly independent. In a minimal exponential family, the mean µ := E[t(X)] is another parameterization of the distribution. That is, there is a 1-1 mapping between η and µ. The function a(η) is convex. (It is log-sum-exponential.) Thus there is a 1-1 mapping between its argument and its derivative. Thus there is a 1-1 mapping between η and E[t(X)]. Side note: the MLE of an exponential family matches the mean parameters with the empirical statistics of the data. Assume x 1:n are from an exponential family. Find ˆη that maximizes the likelihood of x. This is the η such that E[t(X)] = (1/n) i t(x i). 5 Conjugacy Consider the following set up: η F ( λ) () x i G( η) for i {1,..., n}. (3) 3

This is a classical Bayesian data analysis setting. And, this is used as a component in more complicated models, e.g., in hierarchical models. The posterior distribution of η given the data x 1:n is p(η x 1:n, λ) F (η λ) n G(x i η). (4) When this distribution is in the same family as F, i.e., when its parameters are part of the parameter-space defined by λ, then we say that F and G make a conjugate pair. i=1 For example, A Gaussian likelihood with fixed variance, and a Gaussian prior on the mean A multinomial likelihood and a Dirichlet prior on the probabilities A Bernoulli likelihood and a beta prior on the bias A Poisson likelihood and a gamma prior on the rate In all these settings, the conditional distribution of the parameter given the data is in the same family as the prior. Suppose the data come from an exponential family. Every exponential family has a conjugate prior (in theory), p(x i η) = h l (x) exp{η t(x i ) a l (η)} (5) p(η λ) = h c (η) exp{λ 1 η + λ ( a l (η)) a c (λ)}. (6) The natural parameter λ = λ 1, λ has dimension dim(η) + 1. The sufficient statistics are η, a(η). The other terms depend on the form of the exponential family. For example, when η are multinomial parameters then the other terms help define a Dirichlet. Let s compute the posterior, p(η x 1:n, λ) p(η λ) n p(x i η) (7) i=1 = h(η) exp{λ 1 η + λ ( a(η)) a c (λ)} (8) ( n i=1 h(x i)) exp{η n i=1 t(x i) na x (η)} (9) h(η) exp{(λ 1 + t(x i )) η + (λ + n)( a(η))}. (30) 4

This is the same exponential family as the prior, with parameters n ˆλ 1 = λ 1 + t(x i ) (31) i=1 ˆλ = λ + n. (3) 6 Example: Data from a unit variance Gaussian Suppose the data x i come from a unit variance Gaussian p(x µ) = 1 π exp{ (x µ) /}. (33) This is a simpler exponential family than the previous Gaussian p(x µ) = exp{ x /} π exp{µx µ /}. (34) In this case η = µ (35) t(x) = x (36) h(x) = exp{ x /} π (37) a(η) = µ / = η /. (38) We are interested in the conjugate prior. (State the end result on the next page.) Consider a model with an unknown mean. What is the conjugate prior? It is p(η λ) = h(η) exp{λ 1 η + λ ( η /) a c (λ)} (39) Set λ 1 = λ 1 and λ = λ /. This means the sufficient statistics are η, η. This is a Gaussian distribution. We now know the conjugate prior. Now consider the posterior, ˆλ 1 = λ 1 + n i=1 x i (40) ˆλ = λ + n (41) ˆλ = (λ + n). (4) 5

Let s map this back to traditional Gaussian parameters. The mean is The variance is E[µ x 1:n, λ] = λ 1 + n i=1 x i λ + n Var(µ x 1:n, λ) = 1 λ + n (43) (44) Finally, for closure, let s parameterize everything in the mean parameterization. Consider a prior mean and prior variance {µ 0, σ0}. We know that λ 1 = µ 0 /σ0 (45) λ = 1/σ0 (46) λ = 1/σ0. (47) The expression λ is also called the precision. So the posterior mean is E[µ x 1:n, µ 0, σ 0] = µ 0/σ 0 + n i=1 x i 1/σ 0 + n (48) The posterior variance is Var(µ x 1:n, µ 0, σ 0) = 1 1/σ 0 + n (49) Intuitively, when we haven t seen any data then our estimate of the mean is the prior mean. As we see more data, our estimate of the mean moves towards the sample mean. Before seeing data, our confidence about the estimate is the prior variance. As we see more data, the confidence decreases. 6