5 : Exponential Family and Generalized Linear Models

Size: px

Start display at page:

Download "5 : Exponential Family and Generalized Linear Models"

Sherilyn Copeland
5 years ago
Views:

1 0-708: Probabilistic Graphical Models 0-708, Sprig : Expoetial Family ad Geeralized Liear Models Lecturer: Matthew Gormley Scribes: Yua Li, Yichog Xu, Silu Wag Expoetial Family Probability desity fuctios that are i expoetial family ca be expressed i the followig form. p(x η) h(x)exp{η T T (x) A(η)} A(η) log h(x)exp{η T T (x)}dx Oe example of expoetial family is multiomial distributio. For give data x (x, x 2,..., x k ), x i Multi(, π i ) ad Σ π i, we ca write the probability of the data as follows. p(x π) π x πx2 2...πx k k e log(πx πx πx k k ) e Σk i xilogπi Thus, i the correspodig expoetial form, η [log π, logπ 2,..., logπ k ], x [x, x 2,..., x k ], T(x) x, A(η) 0 ad h(x). Aother example is Dirichlet distributio. Let α, α 2,..., α k > 0. The probability fuctio of such distributio ca be represeted as a expoetial fuctio. p(π α) B(α) ΠK iπ αi i e ΣK i (αi )logπi logb(α) where A(η) logb(α), η [α, α 2,..., α k ], T (x) [logπ, logπ 2,..., π k ] ad h((x)). 2 Cumulat Geeratig Property Notice that oe appealig feature of the expoetial family is that we ca easily compute momets of the distributio by takig derivatives of the log ormalizer A(η).

2 2 5 : Expoetial Family ad Geeralized Liear Models 2. First cumulat a.k.a Mea The first derivative of A(η) is equal to the mea of sufficiet statistics T (X). da dη d dη logz(η) d Z(η) dη Z(η) { } d h(x)exp{η T T (x)}dx Z(η) dη T (x) h(x)exp{ηt T (x)} dx Z(η) T (x)p(x η)dx E[T (X)] 2.2 Secod cumulat a.k.a Variace The secod derivative of A(η) is equal to the variace or first cetral momet of sufficiet statistics T (X). d 2 A dη 2 T (x)exp{η T T (x) A(η)}(T (x) da dη )h(x)dx T 2 (x) h(x)exp{ηt T (x)} dx da T (x) h(x)exp{ηt T (x)} dx Z(η) dη Z(η) T 2 (x)p(x η)dx da T (x)p(x η)dx dη E[T 2 (X)] (E[T (X)]) 2 V ar[t (X)] 2.3 Momet estimatio Accordigly, the q th derivative gives the q th cetered momet. Whe the sufficiet statistic is a stacked vector, partial derivatives eed to be cosidered. 2.4 Momet vs caoical parameters Sice the momet parameter µ ca be derived from the atural (caoical) parameter η by: Also otice that A(η) is covex sice: da(η) η E[T (x)] def µ d 2 A(η) dη 2 V ar[t (x)] > 0

3 5 : Expoetial Family ad Geeralized Liear Models 3 Hece we ca ivert the relatioship ad ifer the caoical parameter from the momet parameter (-to-) by: η ψ(µ) which meas a distributio i the expoetial family ca be parameterized ot oly by η (the caoical parameterizatio), but also by µ (the momet parameterizatio). 3 Sufficiecy For p(x θ), T(x) is sufficiet for θ if there is o iformatio i X regardig θ beyod that i T(x). θ X T (X) However, it is defied i differet ways i the Bayesia ad frequetist frameworks. Bayesia view θ as a radom variable. To estimate θ, T (X) cotais all the essetial iformatio i X. p(θ T (x), x) p(θ T (x)) Frequetist view θ as a label rather tha a radom variable. distributio of X give T (X) is ot a fuctio of θ. T (X) is sufficiet for θ if the coditioal p(x T (x), θ) p(x T (x)) For udirected models, we have p(x, T (x), θ) ψ (T (x), θ)ψ 2 (x, T (x)) Sice T (x) is fuctio of x, we ca drop T (x) o the left side, ad the divide it by p(θ). p(x θ) g(t (x), θ)h(x, T (x)) Aother importat feature of the expoetial family is that oe ca obtai the sufficiet statistics T (X) simply by ispectio. Oce the distributio fuctio is expressed i the stadard form, we ca directly see T (X) is sufficiet for η. p(x η) h(x)exp{η T T (x) A(η)}

4 4 5 : Expoetial Family ad Geeralized Liear Models 4 MLE for Expoetial Family The reductio obtaied by usig a sufficiet statistic T (X) is particularly otable i the case of IID samplig. Suppose the dataset D is composed of N idepedet radom variables, characterized by the same expoetial family desity. For these i.i.d data, the log-likelihood is l(η; D) log N h(x )exp{η T T (x ) A(η)} log(h(x )) + (η T Take derivative ad set it to zero, we ca get l η A(η) η N ˆµ MLE N N T (x ) N A(η) η T (x ) T (x ) ˆη MLE ψ(ˆµ MLE ) T (x )) NA(η) Our formula ivolves the data oly via the sufficiet statistic N T (X ). This meas that to estimate MLE of η, we oly eed to maitai fixed dimesios of data. For Berouilli, Poisso ad multiomial distributios, it suffices to maitai a sigle value, the sum of the observatios. Idividual data poits ca be throw away. While for the uivariate Gaussia distributio, we eed to maitai the sum N x ad the sum of squares N x Examples. Gaussia distributio: We have [ η Σ µ; ] 2 vec(σ ) T (x) [ x; vec(xx T ) ] A(η) 2 µt Σ µ + log Σ 2 h(x) (2π) k/2. So µ MLE T (x ) N N x.

5 5 : Expoetial Family ad Geeralized Liear Models 5 2. Multiomial distributio: We have η [ l π ] k ; 0 π K T (x) [x] A(η) l h(x). ( K k π k ) So µ MLE N x. 3. Poisso distributio: We have η log λ T (x) x A(η) λ e η h(x) x!. So µ MLE N x. 5 Bayesia Estimatio 5. Cojugacy Prior Likelihood suppose η T (η), posterior p(η φ) exp{φ T T (η) A(η)} p(x η) exp{η T T (x) A(η)} p(η x, φ) p(x η)p(η φ) exp{η T T (x) + φ T T (η) A(η) A(φ)} exp{ T (η) T ( T (x) + φ ) (A(η) + A(φ) )} }{{}}{{}}{{} sufficiet fuc atural parameter A(η,φ)

6 6 5 : Expoetial Family ad Geeralized Liear Models 6 Geeralized Liear Model GLIM is a geeralized form of traditioal liear regressio. As i liear regressio, the observed iput x is assumed to eter the model via a liear combiatio of its elemets ξ θ T x. The output of the model, o the other had, is assumed to have a expoetial family distributio with coditioal mea µ f(ξ), where f is kow as the respose fuctio. Note that for liear regressio f is simply the idetity fuctio. Figure is a graphical represetatio of GLIM. Ad Table lists some correspodece betwee usual regressio types ad choice of f ad Y. Figure : Graphical model of GLIM.. Regressio Type f distributio of Y Liear Regressio idetity N (µ, σ 2 ) Logistic Regressio logistic Beroulli Probit regressio cumulative Gaussia Beroulli Multivariate Regressio logistic Multivariate Table : Examples of regressio types ad choice of f ad Y. 6. Why GLIMs? As a geeralizatio of liear regressio, logistic regressio, probit regressio, etc., GLIM provides a framework for creatig ew coditioal distributios that comes with some coveiet properties. Also GLIMs with the caoical respose fuctios are easy to trai with MLE. However, Bayesia estimatio of GLIMs does t have a closed form of posterior, so we have to tur to approximatio techiques. 6.2 Properties of GLIM Formally, we assume the output of GLIM has the followig form: [ ] p(y η, φ) h(y, φ) exp φ (ηt (x)y A(η)).

7 5 : Expoetial Family ad Geeralized Liear Models 7 This is slightly differet from the traditioal defiitio of EF, where we iclude a ew scale parameter φ; most distributios are aturally expressed i this form. Note that η ψ(µ) ad µ f(ξ) f(θ T x), so we have η ψ(f(θ T x)). So the coditioal distributio of y give x, θ ad φ is [ ] p(y x, θφ) h(y, φ) exp φ (yt ψ(f(θ T x)) A(ψ(f(θ T x)))). There re mostly 2 desig choices of GLIM: the choice of expoetial family ad the choice of f. The choice of the expoetial family is largely costraied by the ature of the data y. E.g., for cotiuous y we use multivariate Gaussia, where for discrete class labels we use Beroulli or multiomial. Respose fuctio is usually chose with some mild costraits, e.g., betwee [0, ] ad beig positive. There s a so-called caoical respose fuctio where we use f ψ ; i this case the coditioal probability is simplified to [ ] p(y x, θφ) h(y, φ) exp φ (θt x y A(θ T x)). Figure 2 lists caoical respose fuctio for several distributios. Figure 3 ad table 2 lists the relatioship betwee variables ad caoical fuctios. Figure 2: Caoical respose fuctio for several distributios.. Figure 3: Relatioship betwee variables ad fuctios.. Regressio Type Caoical Respose µ f(ξ) η f (µ) distributio of Y Liear Regressio Y µ ξ η µ N (µ, σ 2 ) Logistic Regressio Y µ +exp( ξ) η log µ µ Beroulli(µ) Probit regressio N µ φ(ξ) η φ (µ) Beroulli(µ) Table 2: Some regressio types ad their respose/lik fuctios.

8 8 5 : Expoetial Family ad Geeralized Liear Models 6.3 MLE estimatio for GLIMs with caoical respose Now we ca compute the MLE estimatio for caoical respose fuctios: the log likelihood fuctio is l log h(y ) + (θ T x y A(η )). Take derivative with respect to θ(ote that θ is the oly parameter for GLIMs with caoical respose): dl dθ ( x y da(η ) ) dη (y µ )x X T (y µ). dη dθ So we ca do stochastic gradiet ascet with update rule where ρ is the step size. θ (t+) θ (t) + ρ(y (θ (t) ) T x )x Aother method is to use Newto-Raphso methods to obtai a batch-learig algorithm: The update rule is θ (t+) θ (t) H θ J where J is the cost fuctio ad H is the Hessia matrix (secod derivative). We have θ J X T (y µ), ad H 2 l θ θ T θ T (y µ )x x µ η η θ T x µ θ T ( dµ where W diag, dµ 2,, dµ N dη dη 2 dη N X T W X, x µ η x T ). So the update rule is θ (t+) θ (t) H θ J (X T W (t) X) X T W (t) z (t) where the adjusted respose is z (t) Xθ (t) + (W (t) ) (y µ (t) ).

Chapter 6 Principles of Data Reduction

Chapter 6 Principles of Data Reduction Chapter 6 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 0 Chapter 6 Priciples of Data Reductio Sectio 6. Itroductio Goal: To summarize or reduce the data X, X,, X to get iformatio about a