Recap. Vector observation: Y f (y; θ), Y Y R m, θ R d. sample of independent vectors y 1,..., y n. pairwise log-likelihood n m. weights are often 1

Size: px

Start display at page:

Download "Recap. Vector observation: Y f (y; θ), Y Y R m, θ R d. sample of independent vectors y 1,..., y n. pairwise log-likelihood n m. weights are often 1"

Benjamin Goodman
5 years ago
Views:

1 Recap Vector observation: Y f (y; θ), Y Y R m, θ R d sample of independent vectors y 1,..., y n pairwise log-likelihood n m i=1 r=1 s>r w rs log f 2 (y ir, y is ; θ) weights are often 1 more generally, cl(θ; y) = n w k l k (θ; y i ) i=1 k K l k (θ; y i ) = log f (y i A k ; θ) STA 4508: Topics in Likelihood Inference February 2, /23

2 Inference Sample: Y 1,..., Y n, i.i.d., cl(θ; y) = n i=1 cl(θ; y i) ˆθ CL θ. N{0, G 1 (θ)} G n (θ) = H(θ)J(θ) 1 H(θ) w(θ) = 2{cl(ˆθ CL ) cl(θ)}. d a=1 µ az 2 a Z a N(0, 1) µ 1,..., µ d eigenvalues of J(θ)H(θ) 1 = H(θ)G(θ) 1 Nuisance parameters θ = (ψ, λ) n( ˆψ CL ψ). N{0, G ψψ (θ)} w(ψ) = 2{cl(ˆθ CL ) cl( θ ψ )}. d 0 a=1 µ az 2 a µ 1,..., µ d0 eigenvalues of (H ψψ ) 1 G ψψ STA 4508: Topics in Likelihood Inference February 2, /23

3 Example: dichotomized MV Normal Y ir = 1{Z ir > 0} Z N(0, R) r = 1,..., m; i = 1, l 2 (ρ) = n {y ir y is log P(y r = 1, y s = 1) + y ir (1 y is ) log P 10 i=1 s<r + (1 y ir )y is log P 01 + (1 y ir )(1 y is ) log P 00 } a.var(ˆρ CL ) = 1 n 4π 2 m 2 (1 ρ 2 ) (m 1) 2 var(t ) T = i (2y ir y is y ir y is ) s<r var(t ) = nm 4 (p p p 11 p )+ m 3 ( 6p ) + m 2 (...) + m(...) STA 4508: Topics in Likelihood Inference February 2, /23

4 a. variance pairwise full rho ρ ARE ρ ARE Numbers changed from last week. Also incorrect in Cox & Reid 2004 Table 1

5 Example: longitudinal count data subjects i = 1,..., n observations counts y ir, r = 1,... m i model y ir Poisson(u ir x T ir β) u i1,..., u imi gamma-distributed random effects but correlated corr(u ir, u is ) = ρ r s joint density has combinatorial number of terms in m i ; impractical weighted pairwise composite likelihood cl pair (β) = n i=1 1 m i 1 m i m i r=1 s=r+1 f (y ir, y is ; β) weights chosen so that L pair = full likelihood if ρ = 0 Henderson & Shimura, 2003 STA 4508: Topics in Likelihood Inference February 2, /23

6 Example: Varin & Czado 2010 pain severity scores recorded at four time points morning, noon, evening, bed 119 patients; varying number of days per patient covariates: personal and weather response: pain score y ij response at time t ij for observation j on subject i yij a latent variable, continuous y ij = k a k 1 < yij < a k y ij = x T ij β + u i + ɛ ij if u i N(0, σ 2 ) and ɛ ij N(0, 1) f (y i1,..., y imi ) = ni j=1 {Φ(a y ij xij Tβ u i) Φ(a yij 1 xij Tβ u i)}φ( u i σ )du i L(θ; y) = n i=1 STA 4508: Topics in Likelihood Inference February 2, /23

7 ... pain severity scores yij and yij have constant correlation σ 2 /(σ 2 + 1) points nearer in time might be expected to have higher correlation change ɛ ij i.i.d. N(0, 1) to corr(ɛ ij, ɛ ij ) = exp(δ t ij t ij ) now n ãyi1 ãyimi L(θ; y) =... φ ni (z i1,..., z imi ; R i )dz i1... dz imi ã yimi 1 i=1 ã yi1 1 R ijj = σ2 tij σ e δ tij σ pairwise log-likelihood: n m i cl(θ; y) = log f 2 (y ij, y ij ; θ)1 [ q,q] (t ij t ij ) j<j i=1 weights are 1 or 0, depending on distance between time points STA 4508: Topics in Likelihood Inference February 2, /23

8 ... pain severity scores Table 4 Varin & Czado Supplementary material online STA 4508: Topics in Likelihood Inference February 2, /23

9 Applications: spatial and space-time data conditional approaches seem more natural condition on neighbours in space condition on small number of lags (in time) some form of blockwise components often proposed Stein et al, 04; Caragea and Smith, 07 fmri time series Kang et al, 12 air pollution and health effects Bai et al, 12 computer experiments: Gaussian process models Xi, 12 spatially correlated extremes joint tail probability known joint density requires combinatorial effort (partial derivatives) composite likelihood based on joint distribution of pairs, triples seems to work well Davison et al, (2012); Genton et al., 12, Ribatet, 12 STA 4508: Topics in Likelihood Inference February 2, /23

10 Spatial extremes vector observations (X 1i,..., X mi ), i = 1,..., n example rainfall at each of m locations component-wise maxima Z 1,..., Z m ; Z j = max(x j1,..., X jn ) Z j are transformed (centered and scaled) general theory says Pr(Z 1 z 1,..., Z m z m ) = exp{ V (z 1,..., z m )} function V ( ) can be parameterized via Gaussian process models example V (z 1, z 2 ) = z 1 1 Φ{(1/2)a(h) + a 1 (h) log(z 2 /z 1 )} + z 1 2 Φ{(1/2)a(h) + a 1 (h) log(z 1 /z 2 )} Z (h) = (z 1, z 2 ), Z (0) = (0, 0), a(h) = h T Ω 1 h STA 4508: Topics in Likelihood Inference February 2, /23

11 ... spatial extremes Pr(Z 1 z 1,..., Z d z m ) = exp{ V (z 1,..., z m )} to compute log-likelihood function, need the density combinatorial explosion in computing joint derivatives of V ( ) Davison et al. (2012, Statistical Science) used pairwise composite likelihood compared the fits of several competing models, using AIC analogue described above applied to annual maximum rainfall at several stations near Zurich STA 4508: Topics in Likelihood Inference February 2, /23

12 Davison et al, 2012 STA 4508: Topics in Likelihood Inference February 2, /23

13 ... Davison et al, 2012 STA 4508: Topics in Likelihood Inference February 2, /23

14 ... applications time series a case of large m, fixed n need new arguments re consistency, asymptotic normality consecutive pairs: consistent, not always asy. normal AR(1): consecutive pairs fully efficient; all pairs terrible (consistent, highly variable) MA(1): consecutive pairs consistent but very inefficient Davis and Yau (2011) genetics: estimation of recombination rate somewhat similar to time series but correlation may not decrease with increasing length suggesting all possible pairs may be inconsistent joint blocks of short sequences seems preferable linkage disequilibrium family based sampling Larribe and Fearnhead (2011); Choi and Briollais, 12 STA 4508: Topics in Likelihood Inference February 2, /23

15 Example: Ising model Ising model: f (y; θ) = exp( (j,k) E 1 θ jk y j y k ) Z (θ) j, k = 1,..., K neighbourhood contributions f (y j y ( j) ; θ) = exp(2y j k j θ jky k ) exp(2y j k j θ jky k ) + 1 = exp l j(θ; y) penalized CL estimation based on sample y (1),..., y (n) n K max l θ j (θ; y (i) ) P λ ( θ jk ) j<k i=1 j=1 Xue et al., 2012 Ravikumar et al., 2010 STA 4508: Topics in Likelihood Inference February 2, /23

16 Misspecified models: general theory y 1,..., y n independent observations with distribution function G( ) and density g( ) we fit the incorrect model f (y; θ) Kullback-Liebler (KL) divergence between f (y; θ) and g(y) is defined as { } g(y) KL(θ) = log g(y)dy f (y; θ) Wikipedia writes D KL (G F) or more precisely D KL (P G P F ) KL(θ) 0, and KL(θ) = 0 f (y; θ) g(y) define θ = arg min KL(θ) f (y; θ ) is closest to G( ) in the family {f ( ; θ), θ Θ} STA 4508: Topics in Likelihood Inference February 2, /23

17 ... misspecified models KL(θ) = { } g(y) log g(y)dy f (y; θ) θ = arg min θ KL(θ) θ = arg max θ log{f (y; θ)}g(y)dy = arg maxθ E G {l(θ; y)} leads to a proof that the maximum likelihood estimator converges to θ if g(y) = f (y; θ 0 ), then θ = θ 0 otherwise θ is the least false parameter value STA 4508: Topics in Likelihood Inference February 2, /23

18 ... misspecified models Example true model G is log-normal log y N(µ, σ 2 ) g(y) =? Model f (y; θ) = 1 θ exp( y θ ) E G {l(θ; y)} = log θ + E G ( y θ ) θ = E G (y) = exp(µ + σ 2 /2) reminder: MLE solves θ l(θ; y) = 0 y = (y 1,..., y n) in the example, ˆθ = ȳ p θ STA 4508: Topics in Likelihood Inference February 2, /23

19 ... misspecified models θ l(ˆθ; y) = 0 inference? E G θ l(θ; y) = θ l(θ; y)g(y)dy E G θθ l(θ; y) = θθ l(θ; y)g(y)dy H(θ) E G θ l(θ; y) 2 = θ l(θ; y) 2 g(y)dy J(θ) (ˆθ θ ) = H(θ ) 1 U(θ ){1 + o p (1)} = 0, at θ U(θ) = θ l(θ) ˆθ. N{θ, G 1 (θ )} G(θ) = H(θ)J 1 (θ)h(θ) STA 4508: Topics in Likelihood Inference February 2, /23

20 Examples y 1,..., y n i.i.d. G; we assume f (y; θ) is N(µ, σ 2 ) ˆµ = ȳ, ˆσ 2 = 1 n (yi ȳ) 2 θ l(θ; y) = [ Σ(y i µ)/σ 2, (n/2σ 2 ) + Σ(y i µ) 2 /(2σ 4 ) ] [ ] 1/(σ 2 H(θ) = n 0 ) 0 0 1/(2σ0 4 [ ) 1/(σ 2 J(θ) = n 0 ) µ 3 /(2σ0 6) µ 3 /(2σ0 6 (µ 4 σ0 4)/(4σ8 0 ) σ 2 0 = var G(y 1 ), µ 3 = E G (y 1 µ 0 ) 3, etc. ] H(θ) = E G l (θ; y); J(θ) = Var G l (θ; y) STA 4508: Topics in Likelihood Inference February 2, /23

21 ... examples y 1,..., y n i.i.d. G; we assume f (y; θ) is N(µ, σ 2 ) a.var(ˆθ) = [ σ 2 0 /n γ 1 σ0 3/n γ 1 σ0 3/n γ 2σ0 4/n ] γ 3 = E(y 1 µ 0 ) 3 /σ 3 0, γ 4 = E(y 1 µ 0 ) 4 /(σ 4 0 ) 1 note that ˆµ, ˆσ 2 not uncorrelated unless γ 3 = 0 ˆθ = (ˆµ, ˆσ 2 ) = {ȳ, Σ(y i ȳ) 2 /n} STA 4508: Topics in Likelihood Inference February 2, /23

22 ... examples linear regression model G: y = Xβ 0 + ɛ, ɛ (0, σ 2 R) linear regression; correlated errors working model f (y; β) = N(Xβ, σ 2 I) uncorrelated errors ˆβ = (X T X) 1 X T y least squares estimator; mle under normality E G ( ˆβ) = β 0 LSE unbiased (consistent) var G ( ˆβ) = σ 2 (X T X) 1 X T RX(X T X) 1 sandwich variance working model variance is incorrect σ 2 (X T X) 1 STA 4508: Topics in Likelihood Inference February 2, /23

23 ... examples linear regression model G: y = Xβ 0 + ɛ, ɛ (0, σ 2 R) linear regression; correlated errors var G ( ˆβ) = σ 2 (X T X) 1 X T RX(X T X) 1 sandwich variance working model variance is incorrect σ 2 (X T X) 1 single intercept model, constant correlation ρ: ˆβ = ȳ true standard error of ȳ is σ{1 + ρ(n 1)} 1/2 / n compare to σ/ n ρ = 0.1, n = 100, ratio is 3.3 variance inflation STA 4508: Topics in Likelihood Inference February 2, /23

Composite Likelihood

Composite Likelihood Nancy Reid January 30, 2012 with Cristiano Varin and thanks to Don Fraser, Grace Yi, Ximing Xu Background parametric model f (y; θ), y R m ; θ R d likelihood function L(θ; y) f (y;