Various types of likelihood 1. likelihood, marginal likelihood, conditional likelihood, profile likelihood, adjusted profile likelihood 2. semi-parametric likelihood, partial likelihood 3. empirical likelihood, penalized likelihood 4. quasi-likelihood, composite likelihood 5. simulated likelihood, indirect inference 6. bootstrap likelihood, h-likelihood, weighted likelihood, pseudo-likelihood, local likelihood, sieve likelihood STA 4508 November 6 2018 1
November 6 HW 2 comments & K-L divergence presentations semi-parametric likelihood as profile empirical likelihood composite likelihood STA 4508 November 6 2018 2
Exercises October 23 STA 4508S (Fall, 2018) 1. The Kullback-Leibler divergence from the distribution G to the distribution F is given by KL(F : G) = log f(y) f(y)dy, (1) g(y) where f and g are and density functions with respect to Lebesgue measure. Note that the divergence is not symmetric in its arguments. This is called the directed information distance in Barndorff-Nielsen and Cox (1994) where the more general definition KL(F : G) = log(df/dg)df is used, assuming F and G are mutually absolutely continuous. (a) In the canonical exponential family model with density f(s; ϕ) = exp{ϕ T s k(ϕ)}h(s), s R p, find an expression for the KL divergence between the model with parameter ϕ1 and that with parameter ϕ2. (b) Show that for a sample of observations from a model with density f(y;θ) the maximum likelihood estimator minimizes the KL divergence from F ( ; θ) to Gn( ), where Gn( ) is the empirical distribution function putting mass 1/n at each observation yi. 2. Suppose yi N(µi, 1/n), i = 1,..., k and ψ 2 = Σ k i=1µ 2 i is the parameter of interest. 1 (a) Show that the marginal posterior density for nψ 2, assuming a flat prior π(µ) 1, is a non-central χ 2 k distribution, with noncentrality parameter nσyi 2. (b) Show that the maximum likelihood estimate of ψ 2 is ˆψ 2 = Σyi 2, and that n ˆψ 2 has a non-central χ 2 k distribution with non-centrality parameter nψ 2. (c) Compare the normal approximations to ru(ψ), re(ψ) and r(ψ) with the exact distribution of the maximum likelihood estimate. (d) Compare the 95% Bayesian posterior probability interval for ψ 2, based on (a) to the 95% confidence interval for ψ 2, based on (b). 1 It will be convenient to use λi = µi/ Σµ 2 i 1 for the nuisance parameters.
Proportional hazards regression partial log-likelihood function y 1 < y 2 < < y n l part (β; y, d) = n d i {xi T β log exp(xi T β)} j Ri i=1 can be motivated as: R i = {j : y j y i } set of individuals that could be observed to fail at time y i see SM 10.8 for treatment of ties 1. marginal log-likelihood of the ranks of the failure times 2. n Pr(unit i fails at y i R i, there is one failure at y i ) i=1 3. profile log-likelihood function if λ( ) is represented by a vector of values (λ 1,..., λ n) = {λ(y 1),... λ(y n)} STA 4508 November 6 2018 4
Inference n l p( θ n) = l p(θ 0) + ( θ n θ 0) T Ũ j (θ 0) 1 2 n( θ n θ 0) T ĩ 1 (θ 0)( θ n θ 0) j=1 + o p( n θ n θ 0 + 1) 2 Ũ is the projection of l/ θ on space spanned by nuisance function as in parametric models, lead to (ˆθ θ 0). N{0, ĩ 1 (θ 0)} and likelihood ratio test 2{l p(ˆθ) l p(θ 0)}. χ 2 d proof uses least favourable sub-models through the true model effectively turns infinite-dimensional parameter finite STA 4508 November 6 2018 5
Infinite-dimensional models recall that L(θ; y) f(y; θ) f(y; θ) a density w.r. to dominating measure more abstract definition: if a probability measure Q is absolutely continuous w.r. to a probability measure P, and both possess densities w.r. to a measure µ, then the likelihood of Q w.r. to P is the Radon-Nikodym derivative dq dp = q p, a.e. P some semi-parametric models have a dominating measure, and a family of densities some can be handled by the notion of empirical likelihood some may use mixtures of these STA 4508 November 6 2018 6
... infinite-dimensional models Definition: Given a measure P, and a sample (y 1,..., y n ), the empirical likelihood function is EL(P; y) = n P({y i }), where P{y} is the measure of the one-point set {y} i=1 Definition: Given a model P, a maximum likelihood estimator is the distribution P that maximizes the empirical likelihood over P may or may not exist STA 4508 November 6 2018 7
Example: the empirical distribution vdw 25.68 P is the set of all probability distributions on a measurable space {Y, A} suppose the observed values y 1,..., y n are distinct {(P{y 1 },..., P({y n }); P P} (p 1,..., p n ), p i 0, Σp i = 1) 1-point sets are measurable empirical likelihood maximized at ( 1 n,..., 1 n ) empirical distribution function is the nonparametric MLE F n( ) = n 1 Σ1(Y i ) EL is not the same as f(y i ), even if P has a density f STA 4508 November 6 2018 8
Compare Owen, Ch. 2 for y R, define F(y) = Pr(Y y) and F(y ) = Pr(Y < y) for y 1,..., y n the nonparametric likelihood function is n L(F) = {F(y i ) F(y )}, i hence 0 if F is continuous Theorem 2.1 of Owen: i=1 L(F) < L(F n ), F n (y) = 1 n Σ1{y i y} there is a likelihood function on the space of distribution functions for which the empirical c.d.f. is the maximum likelihood estimator why does this fail for densities? STA 4508 November 6 2018 9
Aside: empirical likelihood Owen, Ch.2 profile version of empirical likelihood { } L(F) R(θ) = sup F F, T(F) = θ L(F n ) R a relative likelihood, hence np i example: T(F) = xdf(x) R(θ) = max{ n i=1 np i n i=1 p iy i = θ, p i 0, p i = 1} For y 1,..., y n i.i.d. F 0, E(y i ) = θ 0, var(y i ) <, 2 log R(θ 0 ) d χ 2 1, n ˆp i = 1 1 n {1 + α(y i θ 0 )}, 1 n n y i θ 0 1 + α(y i θ 0 ) = 0 Theorem 2.2 Owen i=1 STA 4508 November 6 2018 10
Semi-parametric logistic regression VandW Ex.25.71 eθv+η(w) Pr(Y = 1 V, W) = 1 + e θv+η(w) sample (Y i, V i, W i ), i = 1,..., n independent L(θ, η; Y) n { e θv i +η(w i ) } yi { 1 1 + e θv i+η(w i ) 1 + e θv i+η(w i ) i=1 η(w i ) = when y i = 1, η(w i ) = when y i = 0 gives L(θ, η) } 1 yi suggestion: penalized log-likelihood log L(θ, η; Y) ˆα n 2 {η (k) (w)} 2 dw we can t maximize it needs separate analysis of properties STA 4508 November 6 2018 11
Example: missing covariate Murphy and vdw, 2000 observation (D, W, Z); D and W are independent, given Z Pr(D = 0) = {1 + exp(γ + βe z )} 1 W N(α 0 + α 1 z; σ 2 ) Z g( ), non-parametric (d C, w C, z C ) a complete observation Z is the gold standard covariate, e.g. LDL cholesterol (d R, w R ) has a missing covariate W is a surrogate for Z f(x; θ, g) = f(d C, w C z C ; θ)g(z C ) f(d R, w R z; θ)g(z)dz EL(θ, g) = f(d C, w C z C ; θ)g{z C } x = (d C, w C, z C, d R, w R ) f(d R, w R z)g(z)dz θ = γ, β, α 0, α 1, σ 2 can be profiled, according to M&vdW STA 4508 November 6 2018 12
Various types of likelihood 1. likelihood, marginal likelihood, conditional likelihood, profile likelihood, adjusted profile likelihood 2. semi-parametric likelihood, partial likelihood 3. empirical likelihood, penalized likelihood 4. quasi-likelihood, composite likelihood 5. simulated likelihood, indirect inference 6. bootstrap likelihood, h-likelihood, weighted likelihood, pseudo-likelihood, local likelihood, sieve likelihood STA 4508 November 6 2018 13
Composite likelihood Vector observation: Y f(y; θ), Y Y R m, θ R d Set of events: {A k, k K} Composite Log-Likelihood: Lindsay, 1988 cl(θ; y) = k K w k l k (θ; y) l k (θ; y) = log{f({y A k }; θ)} log-likelihood for an event {w k, k K} a set of weights also called: pseudo-likelihood (spatial modelling) quasi-likelihood (econometrics) limited information method (psychometrics) STA 4508 November 6 2018 14
Examples of composite log-likelihood m r=1 w r log f 1 (y r ; θ) m r=1 s>r w rs log f 2 (y r, y s ; θ) m r=1 w r log f(y r y ( r) ; θ) m r=1 s>r w rs log f(y r y s ; θ) m r=1 w r log f(y r y r 1 ; θ) m r=1 w r log f(y r neighbours of y r ; θ) Independence Pairwise Conditional All pairs conditional Time series Spatial Small blocks of observations; pairwise differences;... your favourite combination... STA 4508 November 6 2018 15
Derived quantities single response y with density f(y; θ), y R m, θ R d composite log-likelihood cl(θ; y) = log cl(θ; y) = k w kl k (θ; y) composite score function U CL (θ) = cl(θ; y)/ θ sensitivity H(θ) = E θ { 2 cl(θ; y)/ θ θ T } variability J(θ) = E θ {U CL (θ)u(θ) T } Godambe information G(θ) = H(θ)J 1 (θ)h(θ) STA 4508 November 6 2018 16
... derived quantities sample y = (y 1,..., y n ) with joint density f(y; θ), y R m, θ R d score function U CL (θ) = θ cl(θ; y) = n i=1 θ cl(θ; y i) maximum composite ˆθ CL = ˆθ CL (y) = arg sup θ cl(θ; y) likelihood estimate score equation U CL (ˆθ CL ) = cl (ˆθ CL ) = 0 composite LRT Godambe information w CL (θ) = 2{cl(ˆθ CL ) cl(θ)} G(θ) = G n (θ) = H n (θ)j 1 n (θ)h n (θ) = O(n) STA 4508 November 6 2018 17
Inference Sample: Y 1,..., Y n, i.i.d., CL(θ; y) = n i=1 CL(θ; y i) ˆθ CL θ. N{0, G 1 (θ)} G n (θ) = H(θ)J(θ) 1 H(θ) U(ˆθ CL ). = U(θ) + (ˆθ CL θ) θ U(θ) U = U CL ˆθ CL θ. = θ U(θ) 1 U(θ). = H 1 (θ)u(θ) U(θ). N{0, J(θ)} H 1 (θ)u(θ). N{0, H 1 (θ)j(θ)h T (θ)} conclude n(ˆθ CL θ). N{0, G 1 (θ)} STA 4508 November 6 2018 18
... inference w(θ) = 2{cl(ˆθ CL ) cl(θ)}. d a=1 µ az 2 a Z a N(0, 1) µ 1,..., µ d eigenvalues of J(θ)H(θ) 1 cl(ˆθ CL ) cl(θ) =. 1 2 (ˆθ CL θ) T { cl (ˆθ CL )}(ˆθ CL θ) non-central χ 2 limit J(θ) = varu(θ), H(θ) = E θ U(θ) if J(θ) = H(θ), w(θ). χ 2 d if d = 1, w(θ). µ 1 χ 2 1 = J(θ)H 1 (θ)χ 2 1 H, J both scalars STA 4508 November 6 2018 19
Example: symmetric normal Y i N(0, R), var(y ir ) = 1, corr (Y ir, Y is ) = ρ compound bivariate normal densities to form pairwise likelihood nm(m 1) cl(ρ; y 1,..., y n ) = log(1 ρ 2 ) m 1 + ρ 4 2(1 ρ 2 ) SS w (m 1)(1 ρ) SS b 2(1 ρ 2 ) m n m n SS w = (y is ȳ i. ) 2, SS b = i=1 s=1 n(m 1) l(ρ; y 1,..., y n ) = log(1 ρ) n log{1 + (m 1)ρ} 2 2 1 2(1 ρ) SS 1 SS b w 2{1 + (m 1)ρ} m i=1 y 2 i. STA 4508 November 6 2018 20
... symmetric normal a. var(ˆρ) = a. var(ˆρ CL ) = 2 {1 + (m 1)ρ} 2 (1 ρ) 2 nm(m 1) 1 + (m 1)ρ 2 2 (1 ρ) 2 c(m, ρ) nm(m 1) (1 + ρ 2 ) 2 c(m, ρ) = (1 ρ) 2 (3ρ 2 + 1) + mρ( 3ρ 3 + 8ρ 2 3ρ + 2) + m 2 ρ 2 (1 ρ) 2 2 (1 ρ) 2 a.var(ˆρ CL ) = nm(m 1) (1 + ρ 2 c(m, ρ) ) 2 O( 1 n ) O(1) n m STA 4508 November 6 2018 21
... symmetric normal a.var(ˆρ ), m = 3, 5, 8, 10 a.var(ˆρ CL ) (Cox & Reid, 2004) efficiency 0.85 0.90 0.95 1.00 0.0 0.2 0.4 0.6 0.8 1.0 ρ STA 4508 November 6 2018 22
Likelihood ratio test log likelihoods 30 20 10 0 rho=0.5, n=10, q=5 log likelihoods 40 30 20 10 0 rho=0.8, n=10, q=5 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 rho rho log likelihoods 100 60 40 20 0 rho=0.2, n=10, q=5 log likelihoods 70 50 30 10 0 rho=0.2, n=7, q=5 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 rho rho STA 4508 November 6 2018 23
Example: longitudinal count data subjects i = 1,..., n observations counts y ir, r = 1,... m i model y ir Poisson(u ir x T ir β) u i1,..., u imi gamma-distributed random effects but correlated corr(u ir, u is ) = ρ r s joint density has combinatorial number of terms in m i ; impractical weighted pairwise composite likelihood L pair (β) = n i=1 1 m i 1 m i m i r=1 s=r+1 f(y ir, y is ; β) weights chosen so that L pair = full likelihood if ρ = 0 Henderson & Shimura, 2003 STA 4508 November 6 2018 24