Conjugacy and the Exponential Family

Similar documents
MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

EM and Structure Learning

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

The EM Algorithm (Dempster, Laird, Rubin 1977) The missing data or incomplete data setting: ODL(φ;Y ) = [Y;φ] = [Y X,φ][X φ] = X

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Goodness of fit and Wilks theorem

Probabilistic Classification: Bayes Classifiers. Lecture 6:

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

Gaussian process classification: a message-passing viewpoint

Composite Hypotheses testing

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

More metrics on cartesian products

Week 5: Neural Networks

Differentiating Gaussian Processes

Hidden Markov Models

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Statistics and Probability Theory in Civil, Surveying and Environmental Engineering

Limited Dependent Variables

Machine learning: Density estimation

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

Maximum Likelihood Estimation (MLE)

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Computation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

3.1 ML and Empirical Distribution

MATH 5707 HOMEWORK 4 SOLUTIONS 2. 2 i 2p i E(X i ) + E(Xi 2 ) ä i=1. i=1

Maximum Likelihood Estimation

Lecture Notes on Linear Regression

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Computing MLE Bias Empirically

Strong Markov property: Same assertion holds for stopping times τ.

The Geometry of Logit and Probit

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

MATH 281A: Homework #6

1 (1 + ( )) = 1 8 ( ) = (c) Carrying out the Taylor expansion, in this case, the series truncates at second order:

Classification as a Regression Problem

Open Systems: Chemical Potential and Partial Molar Quantities Chemical Potential

PHYS 705: Classical Mechanics. Calculus of Variations II

Learning from Data 1 Naive Bayes

Module 3: Element Properties Lecture 1: Natural Coordinates

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Lecture 12: Classification

Expectation propagation

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

Canonical transformations

10-701/ Machine Learning, Fall 2005 Homework 3

Prof. Dr. I. Nasser Phys 630, T Aug-15 One_dimensional_Ising_Model

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

Supplementary material: Margin based PU Learning. Matrix Concentration Inequalities

Comparison of the Population Variance Estimators. of 2-Parameter Exponential Distribution Based on. Multiple Criteria Decision Making Method

Probabilistic & Unsupervised Learning

Lecture 3: Probability Distributions

π e ax2 dx = x 2 e ax2 dx or x 3 e ax2 dx = 1 x 4 e ax2 dx = 3 π 8a 5/2 (a) We are considering the Maxwell velocity distribution function: 2πτ/m

Generalized Linear Methods

Econ Statistical Properties of the OLS estimator. Sanjaya DeSilva

Effects of Ignoring Correlations When Computing Sample Chi-Square. John W. Fowler February 26, 2012

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

The Basic Idea of EM

Assortment Optimization under MNL

Robert Eisberg Second edition CH 09 Multielectron atoms ground states and x-ray excitations

Homework Assignment 3 Due in class, Thursday October 15

The conjugate prior to a Bernoulli is. A) Bernoulli B) Gaussian C) Beta D) none of the above

TAIL BOUNDS FOR SUMS OF GEOMETRIC AND EXPONENTIAL VARIABLES

Statistical analysis using matlab. HY 439 Presented by: George Fortetsanakis

Solutions Homework 4 March 5, 2018

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

1 Motivation and Introduction

The big picture. Outline

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Chapter 8 Indicator Variables

Thermodynamics and statistical mechanics in materials modelling II

Probabilistic & Unsupervised Learning. Introduction and Foundations

NUMERICAL DIFFERENTIATION

Error Probability for M Signals

1 Convex Optimization

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Rockefeller College University at Albany

Feature Selection: Part 1

The Feynman path integral

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Complete subgraphs in multipartite graphs

Generative classification models

Laboratory 3: Method of Least Squares

Bézier curves. Michael S. Floater. September 10, These notes provide an introduction to Bézier curves. i=0

Convergence of random processes

Linear Regression Analysis: Terminology and Notation

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

b ), which stands for uniform distribution on the interval a x< b. = 0 elsewhere

p(z) = 1 a e z/a 1(z 0) yi a i x (1/a) exp y i a i x a i=1 n i=1 (y i a i x) inf 1 (y Ax) inf Ax y (1 ν) y if A (1 ν) = 0 otherwise

Lecture 4: September 12

Lecture 4 Hypothesis Testing

Transcription:

CS281B/Stat241B: Advanced Topcs n Learnng & Decson Makng Conjugacy and the Exponental Famly Lecturer: Mchael I. Jordan Scrbes: Bran Mlch 1 Conjugacy In the prevous lecture, we saw conjugate prors for the multvarate Gaussan dstrbuton. In ths lecture, we dscuss conjugacy more generally. A famly of probablty dstrbutons P s conjugate for a probablty model p f the posteror les n P whenever the pror les n P. Note that the famly of all probablty dstrbutons s conjugate for any model p. What f we have a probablty model p(y θ) for some practcal problem, but the standard conjugate famly for p does not seem to contan a realstc pror dstrbuton? One thng we can do s use a mxture model as our pror: k p(θ) = α j p j (θ) =1 where 0 α 1, α = 1, and each p j (θ) s n a famly P j that s conjugate for p(y θ). If we multply ths p(θ) by p(y θ), then each p j (θ) s multpled by p(y θ), yeldng another dstrbuton n P j. So the posteror p(θ y) s a mxture of the same form. The same result holds for an nfnte mxture ndexed by a contnuous parameter τ: p(θ) = g(τ)p(θ; τ) dτ where g(τ) s an arbtrary densty functon. We can also take a lmt of ncreasngly flat conjugate prors, yeldng an mproper pror wth densty 1 everywhere. Ths s the ultmate unnformatve pror, but t s mproper n that t does not ntegrate to 1. If we multply t by a p(y θ) dstrbuton, such as a Gaussan, the resultng posteror may ntegrate to 1 and thus be a true densty. However, ths s not guaranteed for all choces of p(y θ). If we use a mxture model, how can we set the α s? To be fully Bayesan, we should set them based on some background knowledge. However, a common technque s emprcal Bayes: estmate the α s by maxmum lkelhood on the tranng data. 2 The Exponental Famly For a fully general treatment of conjugate prors, we turn to a very large famly of dstrbutons called the exponental famly, whch actually contans every parametrc famly of dstrbutons. The general form of an exponental famly dstrbuton s: p(x θ) = h(x) exp ( φ(θ) T (x) A(θ) ) (1) Here φ(θ) s the canoncal parameter, often denoted η. T (x) s the suffcent statstc, and A(θ) s the cumulant generatng functon or log partton functon. The x here can range over any set, as long as T (x) 1

2 Conjugacy and the Exponental Famly maps each x to a vector of fxed, fnte dmenson. p(x θ) s a densty wth respect to some underlyng measure µ. The h(x) functon can just be thought of as an adjustment to ths underlyng measure, and s not usually mportant. Snce the expresson n Eq. 1 s a densty, t must ntegrate to 1: h(x)e η T (x) A(θ) dx = 1 where η = φ(θ). Therefore: A(θ) = ln h(x)e η T (x) dx (2) So A(θ) s fully determned by η and T (x). It can be shown that the set of vald η vectors {η : h(x)e η T (x) dx < } s convex. So convex combnatons of vald parameter vectors are also vald parameter vectors. Also, optmzng over the set of vald η s s not too dffcult. So far we have wrtten the cumulant generatng functon as A(θ). We can also wrte t as A(η), usng a dfferent functon A. We wll lmt ourselves to cases where φ(θ) s one-to-one. The second convexty property of the exponental famly s that A(η) s always a convex functon of η. Ths follows from a general convex analyss result about log-sum-exp equaltes such as Eq. 2. A(η) s called the cumulant generatng functon because ts dervatves wth respect to η are the cumulants (central moments,.e., mean, varance, kurtoss, etc.) of the suffcent statstc T (X). For the frst cumulant, we take the frst dervatve: h(x)e η T (x) T (x) dx A(η) = h(x)e η T (x) dx By Eq. 2, the denomnator s e A(η), so: A(η) = h(x)e η T (x) A(η) T (x) dx = ET (x) Smlarly, the Hessan matrx 2 A(η) s the covarance matrx Var(T (x)). Thus, to fnd the cumulants of an exponental famly dstrbuton wth a gven A(η), we don t have to do messy ntegrals; we just have to take dervatves. For more nformaton on the exponental famly, see the recent techncal report by Wanwrght and Jordan, Graphcal models, exponental famles, and varatonal nference (avalable from Prof. Jordan s web page). A good book on the subject s by Lawrence Brown, Fundamentals of Statstcal Exponental Famles, publshed n the IMS Lecture Notes seres n 1986. The exponental famly s qute powerful: t ncludes all the standard dstrbutons such as the Bernoull, Gaussan, gamma, Posson, Ralegh, etc. However, exponental famly dstrbutons are parametrc: the parameter vector η has a fxed dmenson. 2.1 Conjugacy and the exponental famly Consder the setup n Fg. 1, where we take n samples y from an exponental famly dstrbuton: p(y θ) = h(y ) exp ( φ(θ) T (y ) A(θ) ) (3) Let y be the random vector formed by concatenatng all the samples y. Then: ( ) ( p(y θ) = h(y ) exp φ(θ) ) T (y ) na(θ) (4)

Conjugacy and the Exponental Famly 3 µ ν θ Y n Fgure 1: Graphcal model for an exponental famly dstrbuton and ts conjugate pror. Thus, y also has an exponental famly dstrbuton: t has the same canoncal parameter φ(θ); ts suffcent statstc s T (y ); and ts cumulant generatng functon s na(θ). We can construct a conjugate famly of pror dstrbutons as follows, wth two parameters µ and ν: p(θ µ, ν) exp ( φ(θ) µ νa(θ) ) (5) Then the posteror dstrbuton s: ( p(θ y, µ, ν) exp (φ(θ) µ + ) ) T (y ) (n + ν)a(θ) (6) We can also take a herarchcal Bayesan approach, puttng prors on µ and ν as well. It s worth notng that ths s just the mnmal conjugate famly for p(y θ), n the sense that s has a mnmal number of parameters. Of course there are other conjugate famles, such as the famly of mxtures of these dstrbutons, and the famly of all dstrbutons. 2.2 ML estmaton of mean parameters Puttng asde the Bayesan approach for the moment, suppose we want a maxmum lkelhood (ML) estmate of the mean parameter µ = ET (X) for some exponental famly dstrbuton. Ths µ s a functon of η: specfcally µ = η A(η). So t suffces to maxmze the lkelhood wth respect to η. Based on Eq. 4, the log lkelhood s: l(η; y) = η n =1 T (y ) na(η) + c(y) (7) where c(y) s some functon that does not depend on η. Dfferentatng wth respect to η, we get: η l = = n T (y ) n η A(η) =1 n T (y ) net (X) =1 Settng ths to zero, we get: ˆµ ML = 1 n n T (y ) (8) In other words, the maxmum lkelhood estmate for the expectaton of the suffcent statstc s just the emprcal mean of the suffcent statstc. We already knew ths fact for common dstrbutons such as the Gaussan, Posson, etc.; now we have a general proof. =1

4 Conjugacy and the Exponental Famly 2.3 Exercses wth the exponental famly Here are some exercses for the reader nvolvng the exponental famly. Consder the Posson dstrbuton: and the bnomal dstrbuton (wth a fxed n): p(x θ) = p(x θ) = θx e θ x! ( ) n θ x (1 θ) n x x Express each of these dstrbutons n exponental famly form. For example, the canoncal parameter η for the bnomal dstrbuton s ln θ 1 θ. Compute A(θ), and then dfferentate t to obtan the mean µ for each dstrbuton. 2.4 Parameterzatons We have seen several ways of parameterzng the exponental famly, that s, ndexng the set of exponental famly dstrbutons. There s the canoncal parameter η, and also the mean parameter µ = ET (X). Recall that A(η) s a convex functon, so µ = η A(η) s nondecreasng n η. If A(η) s strctly convex, then η A(η) s ncreasng n η, and therefore the mappng between η and µ s one-to-one. See Fg. 2 for an llustraton of ths relatonshp. A(η) slope µ Fgure 2: If η s one-dmensonal, then the mean parameter µ correspondng to η s just the slope of a tangent lne to the cumulant generatng functon A evaluated at η. We have also used another parameter, denoted θ to ndex the set of exponental famly dstrbutons. Ths parameter does not have a specal name; t s just some parameter that s convenent for defnng the dstrbuton. η 3 The Exponental Famly and Graphcal Models 3.1 The Isng model The prevous course n ths sequence dealt wth probablstc graphcal models. We can also thnk of graphcal models as defnng exponental famly dstrbutons. As an example, consder the Isng model, llustrated n Fg. 3. Ths model comes from statstcal physcs, where each node represents the spn (up or down) of a partcle. We represent each partcle s spn wth a varable X takng values {0, 1} (we can formulate an equvalent model wth { 1, 1}). The parameters are θ, representng the external feld on partcle, and θ j, representng the attracton between partcles and j. If and j are not adjacent n the graph, then θ j = 0.

Conjugacy and the Exponental Famly 5 X 1 X 2 X n Fgure 3: An Isng model wth n = 9 nodes The probablty dstrbuton s: p(x θ) = exp θ j x x j + <j θ x A(θ) = 1 Z(θ) exp θ j x x j + <j θ x where Z(θ) s the partton functon. Ths s an exponental famly dstrbuton where the suffcent statstc T (x) conssts of all the values x and x x j (for < j) concatenated together. So f µ ET (X), then µ = EX and µ j = EX X j. Thus, the µ vector contans the expectatons and correlatons of the partcles spns. It can be shown that A(η) s strctly convex, so there s a one-to-one mappng between η and µ. However, actually computng µ from η s #P-hard. In fact, we can thnk of the whole problem of probablstc nference as computng the mean parameter µ from the canoncal parameter η (or some other parameter θ). Of course, ths problem s #P-hard n general. 3.2 Graphcal models n general In an undrected graphcal model, we specfy a potental ψ C for each clque C of the graph. The jont probablty dstrbuton s then gven by: p(x) = 1 ψ C (x C ) (9) Z where Z s some normalzaton constant. We can wrte ths n exponental famly form as: ( ) p(x) = exp ln ψ C (x C ) ln Z C If all the varables n the model have dscrete values, then for each clque C, we can defne an ndcator vector wth an entry for each confguraton of the varables n C. The suffcent statstc T (x) s formed by concatenatng the ndcator vectors for all the clques. In general, f each ψ C s an exponental famly dstrbuton, we can form T (x) by concatenatng the suffcent statstcs for these dstrbutons. In a drected graphcal model, we specfy a condtonal dstrbuton for each varable gven ts parents n the graph. The jont dstrbuton s: p(x) = p(x x π() ) (11) C (10)

6 Conjugacy and the Exponental Famly The exponental famly form of ths dstrbuton s smple: ( ) p(x) = exp ln p(x x π() ) (12) If each p(x x π() ) s an exponental famly dstrbuton, then t s clear that the dstrbuton n Eq. 12 s also n the exponental famly.