Modern Methods of Statistical Learning sf2935 Auxiliary material: Exponential Family of Distributions Timo Koski. Second Quarter 2016

Similar documents
1. Fisher Information

Statistical Theory MT 2007 Problems 4: Solution sketches

Parameter Estimation

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Statistical Theory MT 2006 Problems 4: Solution sketches

Overall Objective Priors

Classical and Bayesian inference

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Classical and Bayesian inference

1 Hypothesis Testing and Model Selection

MIT Spring 2016

INTRODUCTION TO BAYESIAN METHODS II

Chapter 8.8.1: A factorization theorem

1 Introduction. P (n = 1 red ball drawn) =

Lecture 2. (See Exercise 7.22, 7.23, 7.24 in Casella & Berger)

APM 541: Stochastic Modelling in Biology Bayesian Inference. Jay Taylor Fall Jay Taylor (ASU) APM 541 Fall / 53

Lecture 4. f X T, (x t, ) = f X,T (x, t ) f T (t )

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Introduction to Bayesian Methods

Remarks on Improper Ignorance Priors

Lycka till!

Foundations of Statistical Inference

A Very Brief Summary of Statistical Inference, and Examples

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

Statistics & Data Sciences: First Year Prelim Exam May 2018

Predictive Distributions

40.530: Statistics. Professor Chen Zehua. Singapore University of Design and Technology

Stat 451 Lecture Notes Numerical Integration

Chapter 5. Bayesian Statistics

Default priors and model parametrization

Mathematical statistics

Part 2: One-parameter models

Chapter 9: Hypothesis Testing Sections

LECTURE NOTES 57. Lecture 9

Chapter 8: Sampling distributions of estimators Sections

A BAYESIAN MATHEMATICAL STATISTICS PRIMER. José M. Bernardo Universitat de València, Spain

Conjugate Predictive Distributions and Generalized Entropies

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics

The Jeffreys Prior. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) The Jeffreys Prior MATH / 13

Probability and Statistics

POSTERIOR PROPRIETY IN SOME HIERARCHICAL EXPONENTIAL FAMILY MODELS

Chapter 3 : Likelihood function and inference

The binomial model. Assume a uniform prior distribution on p(θ). Write the pdf for this distribution.

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Stat 5101 Lecture Notes

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Expectation Propagation Algorithm

Introduction to Bayesian learning Lecture 2: Bayesian methods for (un)supervised problems

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

An introduction to Variational calculus in Machine Learning

Note 1: Varitional Methods for Latent Dirichlet Allocation

Bayes spaces: use of improper priors and distances between densities

PMR Learning as Inference

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Lecture 26: Likelihood ratio tests

STAT215: Solutions for Homework 1

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

STAT J535: Chapter 5: Classes of Bayesian Priors


STAT215: Solutions for Homework 2

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

10. Composite Hypothesis Testing. ECE 830, Spring 2014

STAT 425: Introduction to Bayesian Analysis

ST5215: Advanced Statistical Theory

Final Examination. STA 215: Statistical Inference. Saturday, 2001 May 5, 9:00am 12:00 noon

Estimation and Maintenance of Measurement Rates for Multiple Extended Target Tracking

STAT 730 Chapter 4: Estimation

Midterm Examination. STA 215: Statistical Inference. Due Wednesday, 2006 Mar 8, 1:15 pm

10. Exchangeability and hierarchical models Objective. Recommended reading

Statistical Machine Learning Lectures 4: Variational Bayes


Foundations of Statistical Inference

Stat 710: Mathematical Statistics Lecture 27

General Bayesian Inference I

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

Nancy Reid SS 6002A Office Hours by appointment

Lecture notes on statistical decision theory Econ 2110, fall 2013

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Exercises and Answers to Chapter 1

Parametric Techniques Lecture 3

The exponential family: Conjugate priors

INTRODUCTION TO BAYESIAN ANALYSIS

Uniformly and Restricted Most Powerful Bayesian Tests

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Decomposable Graphical Gaussian Models

Machine Learning Lecture Notes

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

Empirical Likelihood

Chapter 6 - Sections 5 and 6, Morris H. DeGroot and Mark J. Schervish, Probability and

Stat 5421 Lecture Notes Proper Conjugate Priors for Exponential Families Charles J. Geyer March 28, 2016

Lecture Notes 1: Decisions and Data. In these notes, I describe some basic ideas in decision theory. theory is constructed from

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Lecture 17: Likelihood ratio and asymptotic tests

Lecture 2: Basic Concepts of Statistical Decision Theory

Maximum Likelihood Estimation

Nancy Reid SS 6002A Office Hours by appointment

Brief Review on Estimation Theory

Exponential Families

Transcription:

Auxiliary material: Exponential Family of Distributions Timo Koski Second Quarter 2016

Exponential Families The family of distributions with densities (w.r.t. to a σ-finite measure µ) on X defined by R(θ) T (x) f (x θ) = C (θ)h(x)e is called an exponential family, where C (θ) and h(x) are functions from Θ and X to R +, R(θ) and T (x) are functions from Θ and X to R k, R(θ) T (x) is a scalar product in R k, i.e., R(θ) T (x) = k R i (θ) T i (x) i=1

Exponential Families: Role of µ The σ-finite measure µ appears as follows: N := 1 = C (θ) h(x)e R(θ) T (x) dµ(x). X C (θ) = 1 X h(x)er(θ) T (x) dµ(x). { } θ h(x)e R(θ) T (x) dµ(x) < X N is called the natural parameter space.

EXAMPLES OF EXPONENTIAL FAMILIES: Be(θ) We write where f (x θ) = θ x (1 θ) 1 x. x = 0, 1 f (x θ) = C (θ)e R(θ) x, C (θ) = e log 1 θ, T (x) = x, R(θ) = log θ, h(x) = 1. 1 θ

EXAMPLES OF EXPONENTIAL FAMILIES: N(µ, σ 2 ) x (n) = (x 1, x 2,..., x n ), x i I.I.D. N(µ, σ 2 ). f (x (n) µ, σ 2 ) = 1 σ n (2π) n/2 e 1 2σ 2 n i=1(x i µ) 2 = 1 (2π) n/2 σ n e nµ 2 2σ 2 e 1 2σ 2 n i=1 x2 i + µ σ 2 nx.

EXAMPLES OF EXPONENTIAL FAMILIES: N(µ, σ 2 ) f (x (n) µ, σ 2 1 ) = (2π) n/2 σ n e nµ 2 2σ 2 e 1 2σ 2 n i=1 x2 i + µ σ 2 nx. C (θ) = σ n e nµ2 1 2σ 2, h(x) = (2π) n/2. R(θ) T (x (n)) = R 1 (θ)t 1 (x (n)) + R 2 (θ)t 2 (x (n)) T 1 (x (n)) = n xi 2, T 2 (x (n)) = nx i=1 R 1 (θ) = 1 2σ 2, R 2(θ) = µ σ 2

Exponential Families & Sufficient Statistics (1) In an exponential family there exists a sufficient statistic of constant dimension (i.e., not depending on n) for any I.I.D. sample This means that x 1, x 2,..., x n f (x θ). f (x 1 θ) f (x 2 θ)... f (x n θ) = C (θ) n n h(x i )e R(θ) n i=1 T (x i ) i=1

Exponential Families & Sufficient Statistics (2) where C (θ) n n i=1 h(x i ) e R(θ) n i=1 T (x i ) n T (x i ) X i=1 is a sufficient statistic (explained next ).

Sufficient Statistic: A General Definition For x f (x θ), a function T of x is called a sufficient statistic (for θ), if the distribution of x conditional on T (x) does not depend on θ. f (x T, θ) = f (x T ) Bayesian definition: π (θ x) = f (T, θ)

Sufficient Statistic: Example x (n) = (x 1, x 2,..., x n ), x i I.I.D. Be(θ). f (x (n) θ) = C (θ) n n i=1 e R(θ) x i = C (θ) n e R(θ) n i=1 t(x i ) We know that = θ t (1 θ) n t n t (x i ) Bin (θ, n) i=1

Sufficient Statistic: Example (contn d) Since t = t (x i ) is determined by x (n), f (x (n) θ, t) = f (x (n), t θ) f (t θ) θ = t (1 θ) n t ( ) = n θ t t (1 θ) n t 1 ( n t which does not depend on θ, n i=1 t (x i ) is sufficient. )

Natural Exponential Families (1) If f (x θ) = C (θ)h(x)e θ x where Θ R k and X R k, the family is said to be natural. Here θ x is inner product on R k.

Natural Exponential Families (2) f (x θ) = h(x)e θ x ψ(θ) where ψ (θ) = log C (θ).

Natural Exponential Families: Poisson Distribution f (x λ) = e λ λx, x = 0, 1, 2,..., x! = 1 x! eθx eθ ψ (θ) = e θ, θ = log λ, h(x) = 1 x!

Mean in a Natural Exponential Family If E θ [x] denotes the mean (vector) of x f (x θ) in a natural family, then E θ [x] = xf (x θ) dx = θ ψ (θ). where Θ int(n ) and X R k. Proof: xf (x θ) dx = e ψ(θ) X X X h(x)xe θ x dx.

Mean in a Natural Exponential Family e ψ(θ) X = e ψ(θ) θ h(x)xe θ x dx = e ψ(θ) X h(x) θ e θ x dx X h(x)eθ x dx = e ψ(θ) θ 1 C (θ) = = e ψ(θ) ( θc (θ)) C (θ) 2 = C (θ) ( θc (θ)) C (θ) 2 = ( θc (θ)) C (θ) = θ ( log C (θ)) = θ ψ (θ).

Mean in a Natural Exponential Family : Poisson Distribution f (x λ) = 1 x! eθx eθ ψ (θ) = e θ E θ [x] = d dθ ψ (θ) = eθ = λ.

Conjugate Priors for Exponential Families: An Intuitive Example x (n) = (x 1, x 2,..., x n). x i Po(λ), I.I.D., ( f x (n) ) λ = e nλ λ n i=1 x i n i=1 x i! The likelihood is ( l λ x (n)) e nλ λ n i=1 x i This suggests the conjugate density as the density of the Gamma distribution, which is of the form π (λ) e βλ λ α 1 and hence ( π λ x (n)) e λ(β+n) λ n i=1 x i +α 1

Conjugate Family of Priors for Exponential Families Consider the natural exponential family f (x θ) = h(x)e θ x ψ(θ). Then the conjugate family is given by π (θ) = ψ (θ µ, λ) = K (µ, λ) e θ µ λψ(θ) and the posterior is ψ (θ µ + x, λ + 1)

Conjugate Priors for Exponential Families: Proof Proof: By Bayes rule π (θ x) = f (x θ) π (θ) m(x) We have f (x θ) π (θ) = h(x)e θ x ψ(θ) ψ (θ µ, λ) = h(x)k (µ, λ) e θ (x+µ) (1+λ)ψ(θ)

Conjugate Priors for Exponential Families: Proof m(x) = Θ = h(x)k (µ, λ) f (x θ) π (θ) dθ = Θ e θ (x+µ) (1+λ)ψ(θ) dθ = h(x)k (µ, λ) K (x + µ, λ + 1) 1 as ψ is a density on Θ.

Conjugate Priors for Exponential Families: Proof h(x)k (µ, λ) eθ (x+µ) (1+λ)ψ(θ) π (θ x) = h(x)k (µ, λ) K (x + µ, λ + 1) 1 = K (x + µ, λ + 1) e θ (x+µ) (1+λ)ψ(θ), which shows that the posterior belongs to the same family as the prior and that π (θ x) = ψ (θ µ + x, λ + 1) as claimed.

Conjugate Priors for Exponential Families The proof requires that π (θ) = ψ (θ µ, λ) = K (µ, λ) e θ µ λψ(θ) is a probability density on Θ. The conditions for this are given in exercise 3.35.

A Mean for Exponential Families We have the following properties: if π (θ) = K (x o, λ) e θ x o λψ(θ) then ξ(θ) = Θ E θ [x] π (θ) dθ = x o λ This has been proved by Diaconis and Ylvisaker (1979). The proof is not summarized here.

Posterior Means with Conjugate Priors for Exponential Families if π (θ) = K (µ, λ) e θ µ λψ(θ) then Θ ( E θ [x] π θ x (n)) dθ = µ + nx λ + n This follows from the preceding, as shown by Diaconis and Ylvisaker (1979).

Mean of a Predictive Distribution Θ ( E θ [x] π θ x (n)) ( dθ = xf (x θ) dxπ θ x (n)) dθ Θ X (by Fubini s theorem) = X x (by definition in lecture 9 of sf3935) Θ = ( f (x θ) π θ x (n)) dθdx X xg(x x (n) )dx the mean of the predictive distribution.

Mean of a Predictive Distribution Hence if conjugate priors for exponential families are used, then X xg(x x (n) )dx = µ + nx λ + n is the mean of the corresponding predictive distribution. This suggests µ and λ as virtual observations.

Laplace s Prior P.S. Laplace 1 formulated the principle of insufficient reason to choose a prior as a uniform prior. There are drawbacks in this. Consider Laplace s prior for θ [0, 1] { 1 0 θ 1 π (θ) = 0 elsewhere, Then consider φ = θ 2. 1 http://www-groups.dcs.st-and.ac.uk/ history/mathematicians/laplace.html

Laplace s Prior We find the density of φ = θ 2. Take 0 < v < 1. F φ (v) = P (φ v) = P ( θ v ) = = v. f φ (v) = d dv F φ(v) = d dv v 0 1 1 v = 2 v π (θ) dθ which is no longer uniform. But how come we should have non-uniform prior density for θ 2 when there is full ignorance about θ?

Jeffreys Prior: the idea We want to use a method (M) for choosing a prior density with the following property: If ψ = g (θ), g a monotone map, then the density of ψ given by the method (M) is ( ) π Ψ (ψ) = π g 1 (ψ) d dψ g 1 (ψ) which is the standard probability calculus rule for change of variable in a probability density.

Jeffreys Prior We shall now describe one method (M), i.e., Jeffreys prior. In order to introduce Jeffreys prior we need first to define Fisher information, which will be needed even for purposes other than choice of prior.

Fisher Information of x A parametric model x f (x θ), where f (x θ) is differentiable w.r.t to θ R, we define I (θ), Fisher information of x, as I (θ) = X ( ) log f (x θ) 2 f (x θ) dµ(x) θ Conditions for existence of I (θ) are given in Schervish (1995), p. 111.

Fisher Information of x: An Example [ ( ) ] log f (X θ) 2 I (θ) = E θ θ Example: σ is known. f (x θ) = 1 σ 2π e (x θ)2 /2σ 2, log f (x θ) (x θ) = θ σ 2 [ (x θ) 2 ] I (θ) = E σ 4 = σ2 σ 4 = 1 σ 2

Fisher Information of x, θ R k x f (x θ), where f (x θ) is differentiable w.r.t to θ R k, we define I (θ), Fisher information of x, as the matrix I (θ) = (I ij (θ)) k,k i,j=1 ( log f (x θ) I ij (θ) = Cov θ, θ i log f (x θ) θ j )

Fisher Information of x (n) Same parametric model x i f (x θ), I.I.D., x (n) = (x 1, x 2,..., x n ). ( ) f x (n) θ = f (x 1 θ) f (x 2 θ)... f (x n θ) Fisher information of x (n) is ( ) I x (n) (θ) = log f x (n) θ θ X 2 f ( ) x (n) θ dµ (x (n)) = n I (θ).

Fisher Information of x: another form A parametric model x f (x θ), where f (x θ) is twice differentiable w.r.t to θ R. If we can write ( ) d log f (x θ) f (x θ) dµ(x) = dθ X θ then = X θ I (θ) = X ( log f (x θ) θ ( 2 log f (x θ) θ 2 ) f (x θ) dµ(x), ) f (x θ) dµ(x)

Fisher Information of x, θ R k x f (x θ), where f (x θ) is differentiable w.r.t to θ R k, then under some conditions [ ( ( I (θ) = 2 )) ] k,k log f (x θ) E θ θ i θ j ij i,j=1

Fisher Information of x: Natural Exponential Family For a natural exponential family f (x θ) = h(x)e θ x ψ(θ) 2 log f (x θ) θ i θ j = 2 ψ (θ) θ i θ j so no expectation needs to be computed to obtain I (θ).

Jeffreys Prior defined I (θ) π (θ) := Θ I (θ)dθ assuming that the standardizing integral in the denominator exists. Otherwise the prior is improper.

Jeffreys Prior is a method (M) Let ψ = g (θ), g a monotone map. The prior π(θ) is Jeffreys prior. Let us compute the prior density π Ψ (ψ) for ψ: ( ) π Ψ (ψ) = π g 1 d (ψ) dψ g 1 (ψ) [ ( ) ] log f (X θ) 2 d Eθ θ dψ g 1 (ψ) ( ( log f X g = E g 1 1 (ψ) ) ) 2 d ((ψ) θ dψ g 1 (ψ) ( ( log f X g = E g 1 1 (ψ) ) ) 2 = I (ψ) (ψ) ψ Hence the prior for ψ is the Jeffreys prior.

Maximum Entropy Priors We are going to discuss maximum entropy prior densities. We need a new definition: Kullback s Information Measure.

Kullback s Information Measure Let f (x) and g (x) be two densities. Kullback s information measure I (f ; g) is defined as I (f ; g) := X f (x) log f (x) g (x) dµ(x). We intertpret log f (x) 0 =, 0 log 0 = 0. It can be shown that I (f ; g) 0. Kullback s Information Measure does not require the same kind of conditions for existence as the Fisher information.

Kullback s Information Measure: Two Normal Distributions Let f (x) and g (x) be densities for N ( θ 1 ; σ 2), N ( θ 2 ; σ 2), respectively. Then log f (x) g (x) = 1 [ 2σ 2 (x θ 2 ) 2 (x θ 1 ) 2] I (f ; g) = 1 2σ 2 E θ 1 [ (x θ 2 ) 2 (x θ 1 ) 2] = 1 2σ 2 [ E θ1 (x θ 2 ) 2 σ 2].

Kullback s Information Measure: Two Normal Distributions We have E θ1 (x θ 2 ) 2 = E θ1 (x 2) 2θ 2 E θ1 (x) + θ 2 2 = σ 2 + θ 2 1 2θ 2θ 1 + θ 2 2 = σ2 + (θ 1 θ 2 ) 2. Then I (f ; g) = 1 2σ 2 [ σ 2 + (θ 1 θ 2 ) 2 σ 2] = = 1 2σ 2 (θ 1 θ 2 ) 2. I (f ; g) = 1 2σ 2 (θ 1 θ 2 ) 2

Kullback s Information Measure: Natural exponential densities Let f i (x) = h(x)e θ i x ψ(θ i ), i = 1, 2. Then I (f 1 ; f 2 ) = (θ 1 θ 2 ) θ ψ (θ 1 ) (ψ (θ 1 ) ψ (θ 2 ))

Kullback s Information Measure for Prior Densities Let π (θ) and π o (θ) be two densities on Θ I (π; π o ) = Here v is another σ-finite measure. Θ π (θ) log π (θ) π o (θ) dv(θ).

Maximum Entropy Prior (1) Find π (θ) so that I (π; π o ) := Θ π (θ) log π (θ) π o (θ) dv(θ). is maximized under the constraints (on moments, quantiles e.t.c.) E π [g k (θ)] = ω k. The method is due to E. Jaynes, see, e.g., his Probability Theory: The Logic of Science

Maximum Entropy Prior (1) Find π (θ) so that I (π; π o ) := Θ π (θ) log π (θ) π o (θ) dv(θ). is maximized under the constraints (on moments, quantiles e.t.c.) E π [g k (θ)] = ω k. Maybe Winkler s experiments could be redone like this: the assessor gives several ω k, and maximum entropy π.

Maximum Entropy Prior (2) This gives, by use of calculus of variation, π (θ) = e k λ k g k (θ) π o (θ) e k λ k g k (η) π o (dη), where λ k are derived from Lagrange multipliers.