Introduction: exponential family, conjugacy, and sufficiency (9/2/13)

Similar documents
Exponential Families

Gaussian Models (9/9/13)

Generalized Linear Models and Exponential Families

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

LECTURE 11: EXPONENTIAL FAMILY AND GENERALIZED LINEAR MODELS

The Expectation-Maximization Algorithm

Lecture 2: Conjugate priors

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Statistics: Learning models from data

Probabilistic Graphical Models for Image Analysis - Lecture 4

Chapter 8.8.1: A factorization theorem

Lecture 4 September 15

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Bayesian Regression (1/31/13)

CSC321 Lecture 18: Learning Probabilistic Models

Bayesian Methods: Naïve Bayes

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

CS 361: Probability & Statistics

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

CS 361: Probability & Statistics

Lecture 2: Repetition of probability theory and statistics

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Maximum likelihood estimation

Introduction to Machine Learning. Lecture 2

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Machine Learning

Lecture 4: Exponential family of distributions and generalized linear model (GLM) (Draft: version 0.9.2)

Bayesian Inference and MCMC

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Probabilistic Graphical Models

Introduction to Probabilistic Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Stat 710: Mathematical Statistics Lecture 12

Introduction to Probability and Statistics (Continued)

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

If there exists a threshold k 0 such that. then we can take k = k 0 γ =0 and achieve a test of size α. c 2004 by Mark R. Bell,

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

PMR Learning as Inference

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Probability and Estimation. Alan Moses

Bayesian Models in Machine Learning

Announcements. Proposals graded

15-388/688 - Practical Data Science: Basic probability. J. Zico Kolter Carnegie Mellon University Spring 2018

Random variables. DS GA 1002 Probability and Statistics for Data Science.

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Learning Bayesian network : Given structure and completely observed data

Basics on Probability. Jingrui He 09/11/2007

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Overfitting, Bias / Variance Analysis

Lecture 2: Priors and Conjugacy

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

Mathematical statistics

Algorithms for Uncertainty Quantification

Lecture 18: Learning probabilistic models

5.2 Fisher information and the Cramer-Rao bound

13: Variational inference II

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Introduction to Machine Learning

More Spectral Clustering and an Introduction to Conjugacy

Probabilistic Graphical Models

CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm

Naïve Bayes classification

MIT Spring 2016

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables and Statistical Tables THE UNIVERSITY OF MANCHESTER.

Generalized Linear Models

Introduction to Probability and Statistics (Continued)

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Probability reminders

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Bayesian Linear Regression [DRAFT - In Progress]

Northwestern University Department of Electrical Engineering and Computer Science

Expectation Propagation Algorithm

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Parametric Techniques Lecture 3

CPSC 340: Machine Learning and Data Mining

Bayesian Methods for Machine Learning

Machine Learning

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Conjugate Priors, Uninformative Priors

y Xw 2 2 y Xw λ w 2 2

Mathematical statistics

COS513 LECTURE 8 STATISTICAL CONCEPTS

Generalized Linear Models (1/29/13)

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Some Probability and Statistics

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Parametric Techniques

Lecture 4: Probabilistic Learning

LECTURE 2 NOTES. 1. Minimal sufficient statistics.

Transcription:

STA56: Probabilistic machine learning Introduction: exponential family, conjugacy, and sufficiency 9/2/3 Lecturer: Barbara Engelhardt Scribes: Melissa Dalis, Abhinandan Nath, Abhishek Dubey, Xin Zhou Review In the previous class, we discussed the maximum likelihood estimates MLE and maximum a posteriori MAP estimates. The general ways to calculate them are as follows ĥ MAP arg max P h D h H ĥ MLE arg max P D h h H. How to calculate the MLE? Assume we have the observed data sequence D {x, x 2,..., x n }, with X i independent and identically distributed IID and X i {0, }. We also model the X i s as Bernoulli, X i Berπ, with parameter π [0, ] where π represents the probability that X i is a. Thus the log likelihood for D is lh; D log pd h n log π xi π xi [x i log π + x i log π] The reason we use the logarithm of the likelihood is to facilitate the calculation of the first derivative of the likelihood. The log likelihood is a concave function see Figure. It will first increase as π increases and reach a maximum value and then reduce as π increases. The global maximum value indicates that the corresponding π maximizes the log likelihood of the data; this maximum value will occur at the location where the first derivative of the log likelihood with respect to π is equal to 0 i.e., zero slope of the log likelihood function with respect to π. The logarithm function is monotonically increasing, so maximizing logfx also maximizes fx..2 Examples: Calculating the ĥmle for the Bernoulli Distribution In order to get the maximum of the above concave function, we take the first derivative of lh; D with respect to π, as: lh; D π n x n i x i π π

2 Introduction: exponential family, conjugacy, and sufficiency Figure : Example of a concave function, depicting the global maximum at the position where the slope is zero. Then, set this to 0, we get π N π N 0 0 π π N N 0 π N N N 0 N + N 0 N π ˆπ MLE N N + N 0 where we use N and N 0 as shorthand for the number of heads and the number of tails in the data set: N N 0 x i x i 2 The Exponential Family 2. Why exponential family? The exponential family is the only family of distributions with finite-sized sufficient statistics, meaning that we can compress potentially very large amounts of data into a constant-sized summary without loss of information. This is particularly useful for online learning, where the observed data may become huge e.g., your email inbox, where each email is a sample, and emails arrive in an ongoing way.

Introduction: exponential family, conjugacy, and sufficiency 3 The exponential family is the only family of distributions for which conjugate priors exist, which simplifies the computation of the posterior. They are the core of generalized linear models and variational methods, which we will learn about in this course. Expectations are simple to compute, as we will see today, making our life simple. This is part of the exponential family machinery that we can exploit in our work to make the computation and mathematics simpler. The exponential family includes many of the distributions we ve seen already, including: normal, exponential, gamma, beta, Dirichlet, Bernoulli, Poisson, and many others. An important distribution that does not strictly belong to the exponential family is the uniform distribution. 2.2 Definition A distribution is in the exponential family if its pdf or pmf px η, for x {x, x 2,..., x n } R n and η H R d can be written in the following form: px η hx exp { η T T x Aη } where: η natural parameters Aη log partition function T x sufficient statistics hx base measure µ mean parameter Response function η µ Link function Figure 2: Link and Response functions The relationship between µ and η is shown in Figure 2: the link function and the response function are invertible functions that allow a mapping between the mean parameters and the natural parameters. This is a one-to-one mapping the reason it is one-to-one will become clear later. This fact is enables us to work in terms of either the the space of natural or mean parameters whichever is mathematically most convenient, since converting between them is straightforward. 2.3 Examples In this section, we will represent the Bernoulli and the Gaussian in the exponential family form. We will also derive the link and response functions.

4 Introduction: exponential family, conjugacy, and sufficiency 2.3. Bernoulli distribution in the exponential family We write the probability mass function for a Bernoulli random variable x Berπ in exponential form as below, where π is the mean parameter of the random variable X, e.g., the probability of a single coin flip coming up heads. The general way to derive this is to take the explogpdf of the pmf or pdf and organize the resulting variables to match the form of the exponential family distribution. Berx π π x π x exp [ logπ x π x ] exp [x log π + x log π] exp [xlog π log π + log π] [ ] π exp x log + log π π On comparing the above formula with the exponential family form, we have hx T x x π η log π Aη log π log + expη Converting between η and µ: we can use the logit function, η log π π, to compute the natural parameter η from the mean parameter π. We can use the function π +exp η, called the logistic function, which is just the inverse of the previous relation to compute the mean parameter from the natural parameter. Aη log + e η log π + e η π e η + e η π π + e η. 2.3.2 Gaussian The Gaussian probability density function can be written in exponential form as follows: N x µ, σ 2 exp[ 2πσ 2 /2 2σ 2 x µ2 ] exp[ 2πσ 2 /2 2σ 2 x2 + µ σ 2 x 2σ 2 µ2 ] exp[ logσ 2π /2 2σ 2 µ2 + µ σ 2 x 2σ 2 x2 ]

Introduction: exponential family, conjugacy, and sufficiency 5 On comparing the above with the exponential family form, we have: hx η T x 2π /2 µ σ 2 2σ 2 x x 2 Aη log σ + 2σ 2 µ2 η2 4η 2 2 log 2η 2 2.4 Properties of the exponential family For the exponential family, we have hx expη T T x Aηdx as the integrand is a probability distribution which implies Aη log hx expη T T xdx We will use this property below to compute the mean and variance of a distribution in the exponential family. For this reason, the function Aη is often called the log normalizer. 2.4. Expected Value In the exponential family, we can take the derivatives of the log partition function in order to obtain the cumulants of the sufficient statistics. This is why Aη is often called a cumulant function. Below we will show how to calculate the first and second cumulants of a distribution, which are the mean E[ ] and variance var[ ], in this case, of the sufficient statistics. The first derivative of Aη is the expected value, as shown below. Aη log hx exp{η T T x}dx hx exp{η T T x}t xdx expaη hx exp{η T T x Aη}T xdx px ηt xdx E η [T x]

6 Introduction: exponential family, conjugacy, and sufficiency 2.4.2 Variance It is also simple to calculate the variance, which is equal to the second derivative of the log partition function with respect to the natural parameter, as proved below. 2 Aη T hx exp{η T T x Aη}T x T x Aη dx px ηt xt x A ηdx px ηt 2 xdx A η px ηt xdx E[T 2 x] E 2 [T x] var[t x] Example: the Bernoulli distribution Let X be a Bernoulli random variable with probability π. Then Aη log + e η and T x x as shown in a previous section. As explained above, we can then find the expectation and variance by taking the first and second derivatives of Aη: Aη πη E[x] E[T x] + e η 2 Aη T + e η + e η π π var[t x] + e η The property of the log partition function can be generalize: The m th derivative of Aη is the m th cumulant around the mean, so that the problems of estimating moments integration are simplified to differentiation, making the computation much easier. Also, Aη is a convex function of η, since its second derivative is var[tx, which is always positive semidefinite, and when var[t x] is positive definite, under strict convexity, we are guaranteed that Aη is one-to-one, which means that µ E[T x] Aη is invertible, i.e. one-to-one mapping between the mean parameter and the natural parameter. 3 MLE for the Exponential Family A nice property of the exponential family is that exponential families are closed under sampling. The sufficient statistics T x are finite independent of the size of the data set, i.e., the size of T x does not grow as n D. To see why, consider a sequence of observations X {x, x 2,..., x n } all x i s are i.i.d.. We look at T x as n. We do this by writing the likelihood in the exponential family form. px, x 2,..., x n η n px i η n hx i exp η T n T x i naη

Introduction: exponential family, conjugacy, and sufficiency 7 The sufficient statistics are thus {n, n T jx}, where j {,..., T x }, which has exactly T x + components. For examples: the sufficient statistics for Bernoulli distribution are: { n lx i, n}, and the sufficient statistics for Gaussian distribution are: { n x i, n x2 i, n} 3. Computing MLE We define the log likelihood with respect to the exponential family form as: lη; D log pd η n n log hx i + η T T x i naη This is a concave function of η and hence must have a global maximum, which can be found by equating the derivative of the log likelihood with respect to natural parameter η to 0: lη; x,..., x n 0 T x i n Aη 0 E ηmle [T x] T x i n µη MLE n T x i We see that the theoretical expected sufficient statistics of the model are equal to the empirical average of the sufficient statistics. 4 Conjugacy When the prior and the posterior have the same form, we say that the prior is a conjugate prior for the corresponding likelihood. For all distributions in the exponential family, we can derive a conjugate prior for the natural parameters. Let the prior be pη τ, where τ denotes the hyper-parameters. The posterior can be written as: pη X px ηpη τ The likelihood of the exponential family is: n px η hx i exp η T n T x i naη To make the prior conjugate to the likelihood, the prior pη τ must be in the exponential family form with two terms, one term τ {τ,..., τ k } multiplying the η, the other term τ 0 multiplying Aη, as: pη τ exp { η T τ τ 0 Aη }

8 Introduction: exponential family, conjugacy, and sufficiency Then the posterior can be written as: pη X px ηpη τ n exp η T T x i naη exp η T τ τ 0 Aη { n } exp η T T x i + τ n + τ 0 Aη So we see the posterior has the same exponential family form as the prior, and the posterior hyper-parameters are adding the sum of the sufficient statistics to hyper-parameters of the conjugate prior. The exponential family is the only family of distributions for which the conjugate priors exist. This is a convenient property of the exponential family because conjugate priors simplify computation of the posterior.