Exponential Families and Bayesian Inference

Size: px

Start display at page:

Download "Exponential Families and Bayesian Inference"

Evelyn Morton
5 years ago
Views:

1 Computer Visio Expoetial Families ad Bayesia Iferece Lecture Expoetial Families A expoetial family of distributios is a d-parameter family f(x; havig the followig form: f(x; = h(xe g(t T (x B(, (. where = (,..., d R d, ad g( = [g (,..., g d (], for d fuctios g i : R d R, ad T (x = [T (x,..., T d (x]. Some examples with d =. May well-kow distributios belog to this family. Let us look at some examples:. Beroulli distributio The beroulli distributio characterizes coi tosses: P (X; p = p X ( p X = e X log p+( X log( p. Comparig with equatio (.: = p, T (x = x, g(p = log, B(p = log( p, h(x =. p p. Biomial Distributio The biomial distributio characterizes the umber of success (e.g heads i trials (coi tosses i.e X {0,,..., }. P (X; p = ( x Comparig with Eq 3.: = p, T (x = x, g(p = log ( p X ( p X = x e x log p p + log( p. p p, B(p = log( p, h(x = ( x 3. Poisso Distributio The poisso distributio is give by: f(x; = x e x! = x! ex log. Comparig with Eq 3.: = p, T (x = x, g( = log, B( =, h(x = x!.. -

2 - Lecture A example with d > Normal Distributio The geeral uivariate ormal desity is give by: f(x; µ, σ = (x µ e σ = e x µ +xµ σ log σ. πσ π which is of the form above, settig = [µ σ] T, T (x = [x x] T, g( = [ σ µ σ ] T, B( = µ σ + log σ ad h(x = π. Expoetial Families are closed uder Samplig If X,..., X are sampled i.i.d from a expoetial family, the joit desity has the form: f(x,..., X ; = h(x i e g( P T (X i B(. (. which also belogs has a expoetial form, with h (X,..., X e g ( T (X,...,X B (, T (X,..., X = i T (X i, h (X,..., X = i h(x i ad B ( = B(. B( ad Normalizatio Cosider the form h(xe g(t (x. To tur this expoetial form ito a desity, we eed to divide by the ormalizig costat h(xe g(t (x dx. Defie: B( = log h(xe g(t (x dx. so that h(xe g(t (x = e B(. Now, the expoetial form becomes a desity that itegrates to : h(xe g(t (x f(x; = e B( = h(xe g(t (x B(. So B( is the log of the ormalizig costat. Derivatives of B( ad Momets of T Defie: A(g = log h(xe g T (x dx so that B( = A(g(

3 Lecture -3 Takig the derivative of A with respect to g, we have: A (g = T (x h(x e g T (x dx h(x e g T (x. dx This shows that the derivative of the ormalizig costat gives the Expectatio of T. Oe ca also verify: A (g( = E T (X. (.3 A (g( = V ar T (X. More geerally, a coectio betwee m th derivative ad m th momet of T (X ca be established. This is a very useful result sice the problem of estimatig momets which ivolves computig itegrals has bee tured ito a problem of differetiatig a fuctio. Maximum Likelihood Estimatio We ow use the above properties for Maximum likelihood estimatio based o i.i.d samples X,..., X. The joit desity is give i Eq 3.. The log-likelihood fuctio is give by takig logs i Eq 3.: l(x,..., X ; = g( T (X i B( + The MLE estimate is obtaied by maximizig the fuctio above: ˆ = argmax = argmax g( = argmax g( = argmax g( l(x,..., X T (X i B( + T (X i B( T (X i A(g(. h(x i. h(x i Now dl(x,..., X ; d Settig the derivative equal to 0 we get = dl(x,..., X ; g( (g. dg T (x i A (g( = 0,

4 -4 Lecture which we rewrite usig (.3 T (x i = E T (X. Thus, that maximizes likelihood is the for which the true expectatio of T (X equals the sample expectatio. The oly way i which the data is ivolved i the estimatio of is via the sample mea T (X i, which is refered to as a sufficiet statistic for iferece about. Multivariate Expoetial Family The observatios above also hold for a multivariate d-parameter expoetial family, f(x; = h(xe g(t T (x B( with = [... d ] T,T (X = [T (X... T (X d ] g( = [g (... g d (]. Agai defiig A(g = log h(xe gt T (x dx, the followig results correspodig to the oe-parameter case ca be established g A(g( = E T (X. E T k (X. A g i g j (g( = cov (T i (X, T j (X. The maximum likelihood estimate of is made by solvig the followig set of equatios: T j (x = E T j j =... d. Defiig the discrete empirical distributio which is uiform over the values X... X : R X = δ Xi, we ca express the above equalities as: E RX T j = E T j. At the ML estimate of, the expectatios uder the empirical distributio equals the true expectatio of T.

5 Lecture -5 The Bayesia Approach So far we have used the maximum likelihood method for defiig estimators for which is thought of as a parameter. The Bayesia approach treats parameters as radom variables that ca be described by probabilistic statemets. Bayesia iferece is carried out i the followig way:. We choose a probability desity P ( called the prior distributio that expresses our prior beliefs about before we see ay data.. Defie the family of coditioal distributios P (X. Note that sice ow is a radom variable we write P (X as opposed to P (X;. 3. After observig data X... X, we compute the posterior distributio P ( X... X. For the thid step, we exploy the Bayes rule: P ( X... X = P (X... X P ( P (X... X = P (X P (X... P (X P ( P (X... X What ca we do with the posterior? Two optios are to estimate via the mode or the mea of the posterior distributio. From Bayesia Decisio theory, these optios correspod to optimizig with respect to a zero-oe cost or a squared cost respectively. To maximize the posterior: ˆ = argmax P ( X... X = argmax log P ( X... X = argmax log P (X i + log P (. Note that the ormalizig term P (X... X ca be igored. To estimate via the mea of the posterior. ˆ = P ( X... X I this case, the ormalizig term P (X... X caot be igored. Cojugate Priors I Bayesia statistics a prior distributio is multiplied by the likelihood fuctio ad the ormalized to produce a posterior distributio. A cojugate prior is oe which, whe combied with the likelihood ad ormalized, produces a posterior distributio which is

6 -6 Lecture of the same family as the prior. I most cases oce the uormalized posterior is kow the ormalizatio follows directly from the form of the distributio. Example If oe is estimatig the parameter (the success probability of a Beroulli distributio, ad if oe chooses to use a beta distributio as oe s prior, the the posterior is always aother beta distributio. This allows us to figure out the ormalizig costats bypassig their actual computatio. The Beroulli distributio is give by: We put a Beta distributio B(α, β o p: P (X p = p X ( p ( X P (p = Γ(α + β Γ(αΓ(β pα ( p β where the Γ fuctio is a geeralizatio of the factorial to complex ad real-valued argumets: Γ(α = 0 y α e y dy which for itegers α = gives the factorial Γ( = (!. We kow that for a Beta distributio, the Expectatio is give by: E B(α, β = α α + β Now cosider the posterior distributio of p give the i.i.d sampled data: P (p X... X = P (p P (X i p P (X... X = C α,βp α ( p β pxi ( p ( Xi P (X... X where C α,β = Γ(α+β Γ(αΓ(β is the ormalizig costat for the B(α, β distributio. The above expressio ca be writte as: P (p X... X = C α,βp α ( p β p s ( p ( s P (X... X = C p s+α ( p s+β where s = X i is the umber of sucesses ad C is the ormalizig costat for the posterior. Note that from the form of the posterior we already kow it is a Beta distributio B(s + α, s + β ad the ormalizig costat C is give by C = Γ( + α + β Γ(s + αγ( s + β.

7 Lecture -7 The posterior mea estimate ˆp, therefore, is: ˆp = E B(s + α, s + β = s + α + α + β (.4 Recall that the ML estimate ˆp ML was: ˆp ML = s The posterior estimate i (.4 ad the maximum likelihood estimate are the same asymptotically. However, for small sample sizes (.4 has a smoothig effect. It disallows zero probability ifereces whe the sucess cout is zero, ad eforces the ifluece of a prior estimate. For α = β =, the posterior mea is ˆp = s+ +4 which is the so called Wilso s estimate of p. Cojugate Priors ad the Normal Desity Cosider observatios X,..., X i.i.d N(µ, σ where we assume σ to be kow, ad µ to be the oly ukow parameter. P (X... X µ = πσ e (X i µ σ = (πσ { exp (X i µ } σ Assume a Gaussia prior N(µ 0, σ 0 o the mea i.e our prior belief is to see the mea µ aroud some value µ 0 with variace σ 0 distributed ormally: P (µ = exp (µ µ 0 σ 0 πσ 0 The posterior has the followig form: P (µ X... X = C P (µp (X... X µ = C πσ 0 πσ exp { (X i µ σ (µ µ 0 } σ0 P µ ( X i σ + µ 0 σ 0 = C σ + σ exp «σ + σ 0 where C, C are appropriate ormalizatio costats. From the last expressio it follows

8 -8 Lecture that the posterior is also ormal with mea µ post ad variace σ post give by: µ post = P ( X i + µ σ 0 + σ σ0 σ post σ 0 ( = σ 0 σ σ0 + σ σ = σ + σ 0 X i + µ 0 σ 0 = σ post = σ σ 0 σ 0 + σ P X i. The expres- Recall that the maximum likelihood estimate of the mea was µ ML = sio for µ post above ca be writte as: µ post = σ 0 σ0 + µ σ σ ML + σ0 + µ σ 0 Thus i both examples the posterior mea is a weighted average of the sample mea (Maximum likelihood estimate ad the prior mea. Asymptotically, the posterior mea ad the sample mea are idetical. I the small sample case, the prior belief ca strogly ifluece the choice of µ i the maer expressed above. The posterior mea i the multivariate case has the same form. For estimatio of the covariace of a multivariate ormal it is also possible to defie a cojugate prior - the iverse Wishart distributio o positive defiite matrices. We omit the precise form of the distributio. For our purposes it suffices to ote that the distributio depeds o a cetral covariace C 0 ad a cocetratio parameter a, ad prefers covariaces close to C 0. The fial posterior mea agai has the form of a weighted average of the empirical covariace matrix ad C 0 : C post = Ĉ + ac 0, + a where Ĉ is the empirical covariace which is the maximum likelihood estimate (see lecture.

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1 EECS564 Estimatio, Filterig, ad Detectio Hwk 2 Sols. Witer 25 4. Let Z be a sigle observatio havig desity fuctio where. p (z) = (2z + ), z (a) Assumig that is a oradom parameter, fid ad plot the maximum