Exponential Families and Bayesian Inference

Similar documents
EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Lecture 7: Properties of Random Samples

4. Partial Sums and the Central Limit Theorem

Lecture 9: September 19

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Lecture 12: September 27

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Chapter 6 Principles of Data Reduction

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Lecture 11 and 12: Basic estimation theory

Probability and MLE.

Unbiased Estimation. February 7-12, 2008

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Regression and generalization

Bayesian Methods: Introduction to Multi-parameter Models

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Lecture 33: Bootstrap

Approximations and more PMFs and PDFs

This section is optional.

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

The Bayesian Learning Framework. Back to Maximum Likelihood. Naïve Bayes. Simple Example: Coin Tosses. Given a generative model

Lecture 2: Monte Carlo Simulation

Estimation for Complete Data

Solutions: Homework 3

1.010 Uncertainty in Engineering Fall 2008

Parameter, Statistic and Random Samples

Binomial Distribution

Statistical Theory MT 2009 Problems 1: Solution sketches

5. Likelihood Ratio Tests

Problem Set 4 Due Oct, 12

n n i=1 Often we also need to estimate the variance. Below are three estimators each of which is optimal in some sense: n 1 i=1 k=1 i=1 k=1 i=1 k=1

Lecture 6 Simple alternatives and the Neyman-Pearson lemma

Statistical Theory MT 2008 Problems 1: Solution sketches

Lecture 3. Properties of Summary Statistics: Sampling Distribution

7.1 Convergence of sequences of random variables

Random Variables, Sampling and Estimation

Direction: This test is worth 150 points. You are required to complete this test within 55 minutes.

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.

f(x i ; ) L(x; p) = i=1 To estimate the value of that maximizes L or equivalently ln L we will set =0, for i =1, 2,...,m p x i (1 p) 1 x i i=1

Lecture Note 8 Point Estimators and Point Estimation Methods. MIT Spring 2006 Herman Bennett

LECTURE NOTES 9. 1 Point Estimation. 1.1 The Method of Moments

Distribution of Random Samples & Limit theorems

7.1 Convergence of sequences of random variables

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Maximum Likelihood Estimation and Complexity Regularization

Questions and Answers on Maximum Likelihood

Lecture 10 October Minimaxity and least favorable prior sequences

The standard deviation of the mean

Stat410 Probability and Statistics II (F16)

6. Sufficient, Complete, and Ancillary Statistics

5 : Exponential Family and Generalized Linear Models

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Machine Learning Brett Bernstein

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

EE 4TM4: Digital Communications II Probability Theory

Last Lecture. Unbiased Test

1 Introduction to reducing variance in Monte Carlo simulations

Simulation. Two Rule For Inverting A Distribution Function

Topic 9: Sampling Distributions of Estimators

Some Basic Probability Concepts. 2.1 Experiments, Outcomes and Random Variables

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Lecture 18: Sampling distributions

17. Joint distributions of extreme order statistics Lehmann 5.1; Ferguson 15

ECO 312 Fall 2013 Chris Sims LIKELIHOOD, POSTERIORS, DIAGNOSING NON-NORMALITY

The Sampling Distribution of the Maximum. Likelihood Estimators for the Parameters of. Beta-Binomial Distribution

Last time: Moments of the Poisson distribution from its generating function. Example: Using telescope to measure intensity of an object

Lecture Stat Maximum Likelihood Estimation

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

MA Advanced Econometrics: Properties of Least Squares Estimators

Mathematics 170B Selected HW Solutions.

Lecture 12: November 13, 2018

Introduction to Probability I: Expectations, Bayes Theorem, Gaussians, and the Poisson Distribution. 1

IIT JAM Mathematical Statistics (MS) 2006 SECTION A

Mathematical Statistics - MS

Topic 9: Sampling Distributions of Estimators

LECTURE 8: ASYMPTOTICS I

IE 230 Seat # Name < KEY > Please read these directions. Closed book and notes. 60 minutes.

AAEC/ECON 5126 FINAL EXAM: SOLUTIONS

The beta density, Bayes, Laplace, and Pólya

Topic 9: Sampling Distributions of Estimators

ECE 901 Lecture 13: Maximum Likelihood Estimation

EFFECTIVE WLLN, SLLN, AND CLT IN STATISTICAL MODELS

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

1 Review and Overview

CSE 527, Additional notes on MLE & EM

HOMEWORK I: PREREQUISITES FROM MATH 727

Asymptotics. Hypothesis Testing UMP. Asymptotic Tests and p-values

Review Questions, Chapters 8, 9. f(y) = 0, elsewhere. F (y) = f Y(1) = n ( e y/θ) n 1 1 θ e y/θ = n θ e yn

Statistical Pattern Recognition

SOME THEORY AND PRACTICE OF STATISTICS by Howard G. Tucker

AMS570 Lecture Notes #2

Clases 7-8: Métodos de reducción de varianza en Monte Carlo *

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Worksheet on Generating Functions

Overview of Estimation

Summary. Recap ... Last Lecture. Summary. Theorem

Expectation and Variance of a random variable

Transcription:

Computer Visio Expoetial Families ad Bayesia Iferece Lecture Expoetial Families A expoetial family of distributios is a d-parameter family f(x; havig the followig form: f(x; = h(xe g(t T (x B(, (. where = (,..., d R d, ad g( = [g (,..., g d (], for d fuctios g i : R d R, ad T (x = [T (x,..., T d (x]. Some examples with d =. May well-kow distributios belog to this family. Let us look at some examples:. Beroulli distributio The beroulli distributio characterizes coi tosses: P (X; p = p X ( p X = e X log p+( X log( p. Comparig with equatio (.: = p, T (x = x, g(p = log, B(p = log( p, h(x =. p p. Biomial Distributio The biomial distributio characterizes the umber of success (e.g heads i trials (coi tosses i.e X {0,,..., }. P (X; p = ( x Comparig with Eq 3.: = p, T (x = x, g(p = log ( p X ( p X = x e x log p p + log( p. p p, B(p = log( p, h(x = ( x 3. Poisso Distributio The poisso distributio is give by: f(x; = x e x! = x! ex log. Comparig with Eq 3.: = p, T (x = x, g( = log, B( =, h(x = x!.. -

- Lecture A example with d > Normal Distributio The geeral uivariate ormal desity is give by: f(x; µ, σ = (x µ e σ = e x µ +xµ σ log σ. πσ π which is of the form above, settig = [µ σ] T, T (x = [x x] T, g( = [ σ µ σ ] T, B( = µ σ + log σ ad h(x = π. Expoetial Families are closed uder Samplig If X,..., X are sampled i.i.d from a expoetial family, the joit desity has the form: f(x,..., X ; = h(x i e g( P T (X i B(. (. which also belogs has a expoetial form, with h (X,..., X e g ( T (X,...,X B (, T (X,..., X = i T (X i, h (X,..., X = i h(x i ad B ( = B(. B( ad Normalizatio Cosider the form h(xe g(t (x. To tur this expoetial form ito a desity, we eed to divide by the ormalizig costat h(xe g(t (x dx. Defie: B( = log h(xe g(t (x dx. so that h(xe g(t (x = e B(. Now, the expoetial form becomes a desity that itegrates to : h(xe g(t (x f(x; = e B( = h(xe g(t (x B(. So B( is the log of the ormalizig costat. Derivatives of B( ad Momets of T Defie: A(g = log h(xe g T (x dx so that B( = A(g(

Lecture -3 Takig the derivative of A with respect to g, we have: A (g = T (x h(x e g T (x dx h(x e g T (x. dx This shows that the derivative of the ormalizig costat gives the Expectatio of T. Oe ca also verify: A (g( = E T (X. (.3 A (g( = V ar T (X. More geerally, a coectio betwee m th derivative ad m th momet of T (X ca be established. This is a very useful result sice the problem of estimatig momets which ivolves computig itegrals has bee tured ito a problem of differetiatig a fuctio. Maximum Likelihood Estimatio We ow use the above properties for Maximum likelihood estimatio based o i.i.d samples X,..., X. The joit desity is give i Eq 3.. The log-likelihood fuctio is give by takig logs i Eq 3.: l(x,..., X ; = g( T (X i B( + The MLE estimate is obtaied by maximizig the fuctio above: ˆ = argmax = argmax g( = argmax g( = argmax g( l(x,..., X T (X i B( + T (X i B( T (X i A(g(. h(x i. h(x i Now dl(x,..., X ; d Settig the derivative equal to 0 we get = dl(x,..., X ; g( (g. dg T (x i A (g( = 0,

-4 Lecture which we rewrite usig (.3 T (x i = E T (X. Thus, that maximizes likelihood is the for which the true expectatio of T (X equals the sample expectatio. The oly way i which the data is ivolved i the estimatio of is via the sample mea T (X i, which is refered to as a sufficiet statistic for iferece about. Multivariate Expoetial Family The observatios above also hold for a multivariate d-parameter expoetial family, f(x; = h(xe g(t T (x B( with = [... d ] T,T (X = [T (X... T (X d ] g( = [g (... g d (]. Agai defiig A(g = log h(xe gt T (x dx, the followig results correspodig to the oe-parameter case ca be established g A(g( = E T (X. E T k (X. A g i g j (g( = cov (T i (X, T j (X. The maximum likelihood estimate of is made by solvig the followig set of equatios: T j (x = E T j j =... d. Defiig the discrete empirical distributio which is uiform over the values X... X : R X = δ Xi, we ca express the above equalities as: E RX T j = E T j. At the ML estimate of, the expectatios uder the empirical distributio equals the true expectatio of T.

Lecture -5 The Bayesia Approach So far we have used the maximum likelihood method for defiig estimators for which is thought of as a parameter. The Bayesia approach treats parameters as radom variables that ca be described by probabilistic statemets. Bayesia iferece is carried out i the followig way:. We choose a probability desity P ( called the prior distributio that expresses our prior beliefs about before we see ay data.. Defie the family of coditioal distributios P (X. Note that sice ow is a radom variable we write P (X as opposed to P (X;. 3. After observig data X... X, we compute the posterior distributio P ( X... X. For the thid step, we exploy the Bayes rule: P ( X... X = P (X... X P ( P (X... X = P (X P (X... P (X P ( P (X... X What ca we do with the posterior? Two optios are to estimate via the mode or the mea of the posterior distributio. From Bayesia Decisio theory, these optios correspod to optimizig with respect to a zero-oe cost or a squared cost respectively. To maximize the posterior: ˆ = argmax P ( X... X = argmax log P ( X... X = argmax log P (X i + log P (. Note that the ormalizig term P (X... X ca be igored. To estimate via the mea of the posterior. ˆ = P ( X... X I this case, the ormalizig term P (X... X caot be igored. Cojugate Priors I Bayesia statistics a prior distributio is multiplied by the likelihood fuctio ad the ormalized to produce a posterior distributio. A cojugate prior is oe which, whe combied with the likelihood ad ormalized, produces a posterior distributio which is

-6 Lecture of the same family as the prior. I most cases oce the uormalized posterior is kow the ormalizatio follows directly from the form of the distributio. Example If oe is estimatig the parameter (the success probability of a Beroulli distributio, ad if oe chooses to use a beta distributio as oe s prior, the the posterior is always aother beta distributio. This allows us to figure out the ormalizig costats bypassig their actual computatio. The Beroulli distributio is give by: We put a Beta distributio B(α, β o p: P (X p = p X ( p ( X P (p = Γ(α + β Γ(αΓ(β pα ( p β where the Γ fuctio is a geeralizatio of the factorial to complex ad real-valued argumets: Γ(α = 0 y α e y dy which for itegers α = gives the factorial Γ( = (!. We kow that for a Beta distributio, the Expectatio is give by: E B(α, β = α α + β Now cosider the posterior distributio of p give the i.i.d sampled data: P (p X... X = P (p P (X i p P (X... X = C α,βp α ( p β pxi ( p ( Xi P (X... X where C α,β = Γ(α+β Γ(αΓ(β is the ormalizig costat for the B(α, β distributio. The above expressio ca be writte as: P (p X... X = C α,βp α ( p β p s ( p ( s P (X... X = C p s+α ( p s+β where s = X i is the umber of sucesses ad C is the ormalizig costat for the posterior. Note that from the form of the posterior we already kow it is a Beta distributio B(s + α, s + β ad the ormalizig costat C is give by C = Γ( + α + β Γ(s + αγ( s + β.

Lecture -7 The posterior mea estimate ˆp, therefore, is: ˆp = E B(s + α, s + β = s + α + α + β (.4 Recall that the ML estimate ˆp ML was: ˆp ML = s The posterior estimate i (.4 ad the maximum likelihood estimate are the same asymptotically. However, for small sample sizes (.4 has a smoothig effect. It disallows zero probability ifereces whe the sucess cout is zero, ad eforces the ifluece of a prior estimate. For α = β =, the posterior mea is ˆp = s+ +4 which is the so called Wilso s estimate of p. Cojugate Priors ad the Normal Desity Cosider observatios X,..., X i.i.d N(µ, σ where we assume σ to be kow, ad µ to be the oly ukow parameter. P (X... X µ = πσ e (X i µ σ = (πσ { exp (X i µ } σ Assume a Gaussia prior N(µ 0, σ 0 o the mea i.e our prior belief is to see the mea µ aroud some value µ 0 with variace σ 0 distributed ormally: P (µ = exp (µ µ 0 σ 0 πσ 0 The posterior has the followig form: P (µ X... X = C P (µp (X... X µ = C πσ 0 πσ exp { (X i µ σ (µ µ 0 } σ0 P µ ( X i σ + µ 0 σ 0 = C σ + σ exp «σ + σ 0 where C, C are appropriate ormalizatio costats. From the last expressio it follows

-8 Lecture that the posterior is also ormal with mea µ post ad variace σ post give by: µ post = P ( X i + µ σ 0 + σ σ0 σ post σ 0 ( = σ 0 σ σ0 + σ σ = σ + σ 0 X i + µ 0 σ 0 = σ post = σ σ 0 σ 0 + σ P X i. The expres- Recall that the maximum likelihood estimate of the mea was µ ML = sio for µ post above ca be writte as: µ post = σ 0 σ0 + µ σ σ ML + σ0 + µ σ 0 Thus i both examples the posterior mea is a weighted average of the sample mea (Maximum likelihood estimate ad the prior mea. Asymptotically, the posterior mea ad the sample mea are idetical. I the small sample case, the prior belief ca strogly ifluece the choice of µ i the maer expressed above. The posterior mea i the multivariate case has the same form. For estimatio of the covariace of a multivariate ormal it is also possible to defie a cojugate prior - the iverse Wishart distributio o positive defiite matrices. We omit the precise form of the distributio. For our purposes it suffices to ote that the distributio depeds o a cetral covariace C 0 ad a cocetratio parameter a, ad prefers covariaces close to C 0. The fial posterior mea agai has the form of a weighted average of the empirical covariace matrix ad C 0 : C post = Ĉ + ac 0, + a where Ĉ is the empirical covariace which is the maximum likelihood estimate (see lecture.