The Hybrid Likelihood: Combining Parametric and Empirical Likelihoods

Size: px

Start display at page:

Download "The Hybrid Likelihood: Combining Parametric and Empirical Likelihoods"

Griselda Shepherd
5 years ago
Views:

1 1/25 The Hybrid Likelihood: Combining Parametric and Empirical Likelihoods Nils Lid Hjort (with Ingrid Van Keilegom and Ian McKeague) Department of Mathematics, University of Oslo Building Bridges (at Bislett) May 2017, Teknologihuset

2/25 Bridging parametrics and nonparametrics Building Bridges, connecting parametrics and nonparametrics: Here I discuss one such bridge operation (nonparametric extension of a

2 2/25 Bridging parametrics and nonparametrics Building Bridges, connecting parametrics and nonparametrics: Here I discuss one such bridge operation (nonparametric extension of a parametric likelihood, or parametric focus for a nonparametric method). I use FIC tools (Focused Information Criteria) for fine-tuning and for selecting ingredients for the bridge.

3 delta method giving n( ψml ψ) d N(0, κ 2 ), Theme: Combining parametrics with nonparametrics Suppose y 1,..., y n i.i.d. f, with inference needed for focus parameter ψ = ψ(f ). Parametric likelihood approach (perfect if model is perfect): Fit f to {f θ : θ Θ} via maximum likelihood, θ ML maximising log-likelihood l n (θ) = log L n (θ). Then n( θml θ) d N p (0, J(θ) 1 ), with κ 2 = c t J(θ) 1 c and c = ψ(θ)/ θ. Also: Wilks theorem, χ 2 1. Nonparametric likelihood approach (no conditions needed): Identify ψ via E f m(y, ψ) = 0. The empirical likelihood R n (ψ) is the maximum of n i=1 (nw i) under n i=1 w i = 1, n i=1 w im(y i, ψ) = 0, each w i > 0. Then 2 log R n (ψ) d χ /25

4 How to combine parametric and empirical likelihood? Main idea (with details and variations and applications to come): Decide on control parameters µ = (µ 1,..., µ q ), identified via E m j (Y, µ) = 0 for j = 1,..., q; put the parametric model through the EL, giving R n (µ(θ)); and form H n (θ) = L n (θ) 1 a R n (µ(θ)) a. I will show that the hybrid likelihood estimator θ HL maximising h n (θ) = (1 a)l n (θ) + a log R n (µ(θ)), along with focus parameter estimator ψ HL = ψ(f (, θ HL )), have good properties. I ll invent FIC type schemes to assist in selecting balance parameter a in [0, 1] and the control parameters µ 1,..., µ q. 4/25

5 5/25 Plan General setup (so far for i.i.d., extensions later): With working model f (y, θ), leading to log-likelihood l n (θ), and control parameters µ: h n (θ) = (1 a)l n (θ) + a log R n (µ(θ)). A Examples B Basics for the EL C Theory: under the model D Theory: outside the model E Fine-tuning the balance parameter a F Choosing the control parameters µ 1,..., µ q G Concluding remarks (and questions)

6 6/25 A: Examples Example 1. Let f θ be the normal (ξ, σ 2 ), and use m j (y, µ j ) = I {y µ j } j/4 for j = 1, 2, 3. Then HL means estimating (ξ, σ) factoring in that the three quartiles ought to be estimated well too. Example 2. Let f θ be the Beta with parameters (b, c). ML means moment matching for log y i and log(1 y i ). Add to these functions m 1 (y, µ 1 ) = y µ 1 and m 2 (y, µ 2 ) = y 2 µ 2. Then HL is Beta fitting with getting mean and variance not far from E Beta Y = b b + c and Var Beta Y = 1 b c b + c + 1 b + c b + c.

7 7/25 Example 3. Consider f (y, θ) = θy θ 1 on (0, 1). The log-likelihood is n{log θ (θ 1)Z n }, with Z n = (1/n) n i=1 log(1/y i), and θ ML = 1/Z n. Then put the EL for the mean µ through the model, yielding R n (µ(θ)) with µ(θ) = θ/(θ + 1). This is HL with a = 1 2 : logl, logel, loghl θ

8 8/25 Example 4. I use Newcomb s 1889 speed of light data, with n = 66 and two grand outliers at 44 and 2. True value is ML (a = 0): bad estimates (26.21, 10.75). Removing outliers: (27.75, 5.08). I now use HL with histogram associated control parameters, with k = 6 cells (, 10.5], (10.5, 20.5], (20.5, 25.5], (25.5, 30.5], (30.5, 35.5], (35.5, ). The HL, with a = 0.50: (28.23, 6.37). a = 1: Close to minimum chi-squared.

9 9/25 kernel estimate and two fitted normals y

10 10/25 Two (related) viewpoints Which µ 1,..., µ q should I use in (1 a)l n (θ) + a log R n (µ(θ))? Robustify a parametric model, and/or helping to focus the nonparametric method? Viewpoint One (focused robustness): Using control parameters to help the parametric fit do well for these too. For the normal (ξ, σ 2 ), I might want not only mean and standard deviation to be ok, but also ξ σ, ξ σ to reasonably match quartiles F 1 n ( 1 4 ), F 1 n ( 3 4 ). Viewpoint Two (with focus parameter): I wish the fitted model to give a particularly good estimate of ψ = ψ(f ) via ψ HL = ψ(f (, θ HL )). Then I use the HL with p + 1 parameters, the working model plus my focus ψ. For the normal, I may put in m(y, µ) = I {y µ} 3/4, and use ξ HL σ HL to estimate F 1 ( 3 4 ).

11 B: Empirical Likelihood (1-page course) For q-vectors m 1,..., m n, consider { n n Λ n = max (nw i ): w i = 1, Let G n (λ) = i=1 i=1 n i=1 } w i m i = 0, each w i > 0. n 2 log(1 + λ t m i / n) and Gn (λ) = 2λ t V n λ t W n λ, i=1 where V n = n 1/2 n i=1 m i and W n = n 1 n i=1 m im t i. First: 2 log Λ n = max G n = G n ( λ). Second: With the m i random; eigenvalues of W n away from zero and infinity; n 1/2 max i n m i pr 0; V n bounded in probability: then G n G n where it matters, and 2 log Λ n = V t nw 1 n V n + o pr (1). This machinery is then used with m i = m(y i, µ(θ)). 11/25

12 C: Theory: under the model First aim: working out how the HL behaves under model conditions (it will lose some to ML there, but how much?). With and θ 0 the true value, define h n (θ) = (1 a)l n (θ) + a log R n (µ(θ)), A n (s) = h n (θ 0 + s/ n) h n (θ 0 ) = (1 a){l n (θ 0 + s/ n) l n (θ 0 )} +a{log R n (µ(θ 0 + s/ n)) log R n (µ(θ 0 ))}. Understanding behaviour of A n = understanding behaviour of θ HL (et al.). Under mild conditions, with u(y, θ) the score function: ( ) ( Un,0 n 1/2 n = i=1 u(y ) i, θ 0 ) V n,0 n 1/2 n i=1 m(y i, µ(θ 0 )) ( ) ( ) U0 J C d N V p+q (0, 0 C t ). W Note that J = J fish is the information matrix of the working model. 12/25

13 13/25 Lemma: there is a well-defined and well-behaved limiting quadratic process: where A n (s) = h n (θ 0 + s/ n) h n (θ 0 ) d A(s) = s t U 1 2 st J s, U = (1 a)u 0 aξ t 0W 1 V 0, J = (1 a)j + aξ t 0W 1 ξ 0. Here ξ 0 = E m(y, µ(θ 0 ))/ θ. Also, U N p (0, K ) with K = (1 a) 2 J + a 2 ξ t 0W 1 ξ 0 a(1 a)(cw 1 ξ 0 + ξ t 0W 1 C t ). The most important aspects of how θ HL behaves can be read off from A n (s) d A(s).

14 14/25 Fact 1 [using argmax(a n ) d argmax(a)]: n( θhl θ 0 ) d Λ = (J ) 1 U N p (0, (J ) 1 K (J ) 1 ). Fact 2 [using max A n d max A]: Z n (θ 0 ) = 2{h n ( θ HL ) h n (θ 0 )} d Z = (U ) t (J ) 1 U. Fact 3 [applying the delta method]: With ψ = φ(θ) and ψ HL = φ( θ HL ), and ψ 0 = φ(θ 0 ) at true value, n( ψhl ψ 0 ) d c t Λ N(0, κ 2 ), with κ 2 = c t (J ) 1 K (J ) 1 c, and c = φ(θ 0 )/ θ. May then compute (J ) 1 K (J ) 1 and compare to J 1 fish. Result: The HL loses rather little compared to the ML, under model conditions, if say a 0.15: (J ) 1 K (J ) 1 = J 1 fish + O(a2 ).

15 15/25 D: Theory: outside the model Results so far: behaviour of θ HL and consequent ψ HL well understood under parametric model conditions, where they lose a little but not much compared to ML. Will now show (though a bigger machinery and more efforts are required) that HL is (often) better than ML just outside the parametric model. Framework: extend f (y, θ) model (with dim(θ) = p) to a bigger f (y, θ, γ) model (with dim(γ) = r), and such that γ = γ 0 corresponds to the start model; f (y, θ, γ 0 ) = f (y, θ). Local neighbourhood model framework: f true (y) = f (y, θ 0, γ 0 + δ/ n). Thus ψ true = ψ(θ 0, γ 0 + δ/ n), etc.

16 16/25 Under f (y, θ 0, γ 0 + δ/ n), suppose an estimation strategy θ has the property n( θ θ0 ) d N p (Bδ, Ω), for appropriate B (p r matrix, related to how the model bias affects the estimator) and Ω. For ψ = ψ(f ) = ψ(θ, γ), may use ψ = ψ( θ, γ 0 ). Then analysis leads to n( ψ ψtrue ) d N(b t δ, τ 2 ), with b = B t ψ θ ψ γ and τ 2 = ( ψ θ )t Ω ψ θ with derivatives at narrow model (θ 0, γ 0 ). Hence limit mean squared error is mse(δ) = (b t δ) 2 + τ 2. Next: Examining estimation strategies ML and HL, to find B and Ω, and hence the mse(δ). For ML: as in Hjort and Claeskens (2003); for HL: new.

17 17/25 The story for the ML: Essentially from Hjort and Claeskens (2003), Claeskens and Hjort (2008). Need the (p + r) (p + r) Fisher information matrix ( ) J00 J J wide = 01 J 10 J 11 at the narrow model. From this (via various efforts): n( θml θ 0 ) d N p (J 1 00 J 01δ, J 1 00 ). This implies n( ψml ψ true ) d N(ω t δ, τ 2 0 ) with Hence we know ω = J 10 J00 1 ψ θ ψ γ and τ0 2 = ( ψ mse(δ) = (ω t δ) 2 + τ 2 0 θ )t J00 1 ψ θ. and should compare this with what we may find for the HL.

18 The story for the HL: For S(y) = log f (y, θ 0, γ 0 )/ γ, let of dimension q r, along with K 01 = E m(y, µ(θ 0 ))S(Y ) L 01 = (1 a)j 01 a( ψ θ )t W 1 K 01. Then (via various efforts): n( θhl θ 0 ) d N p (Bδ, Ω) with B = (J ) 1 L 01 and Ω = (J ) 1 K (J ) 1. This yields n( ψhl ψ true ) d N(ω t HLδ, τ 2 0,HL) with ω HL = ω HL,a = L 10 (J 1 ψ ) θ ψ γ, τ0,hl 2 = τ0,hl,a 2 = ( ψ θ )t (J ) 1 K (J 1 ψ ) θ. Here J, K, L 10 depend on the balance parameter a. 18/25

19 May then compare mse ML (δ) = (ω t δ) 2 + τ0 2, mse HL,a (δ) = (ωhl,aδ) t 2 + τ0,hl,a, 2 in different special setups. root mse for HL and ML a 19/25

20 E: Fine-tuning the balance parameter The precision of ψ HL for estimating ψ true depends on the underlying truth and on the balance parameter a. In the f (y, θ 0, γ 0 + δ/ n) framework, the best balance a is the minimiser of Here risk(a) = mse HL,a (δ) = (ω t HL,aδ) 2 + τ 2 0,HL,a. ω HL,a = L 10,a (Ja 1 ψ ) θ ψ γ, τ0,hl,a 2 = ( ψ θ )t (Ja ) 1 Ka (Ja 1 ψ ) θ. may be estimated consistently from data, with δ less visible: D n = n( γ ML γ 0 ) d N r (δ, Q), with Q = J 11 from J 1 wide. 20/25

21 Since D n = n( γ ML γ 0 ) d N r (δ, Q), D n Dn t overestimates δδ t, and E (c t D n ) 2. = (c t δ) 2 + c t Qc. Hence we estimate the squared bias in the FIC way, using sqb = (ω t HL,aδ) 2 ŝqb = max{( ω HL,aD t n ) 2 ω HL,a t Q ω HL,a, 0} { n{ ω HL,a t = ( γ ML γ 0 )} 2 ω HL,a t Q ω HL,a if nonnegative, 0 if else. This leads to risk(a) = ( ψ θ )t (Ĵ a ) 1 K a (Ĵ a ) 1 ψ θ + ŝqb. Via this FIC scheme we select balance parameter a as the minimiser of risk(a). 21/25

22 Example: n = 100 data points on (0, 1), fitted to f (y, θ) = θy θ 1, with control parameter (now equal to the focus parameter) µ = E Y 2. FIC plot for selecting a in the HL estimation strategy: root fic(a) Note: Sometimes a = 0 is best, i.e. choose ML. a 22/25

23 23/25 F: Choosing the control parameters The general hybrid likelihood estimation method is via constructing h n (θ) = (1 a)l n (θ) + a log R n (µ(θ)), which starts with choosing control parameters µ 1,..., µ q. These aim at fitting models such that certain issues are well calibrated outside those taken care of by the ML, which concentrates on the score functions u 1 (y, θ),..., u p (y, θ). Can choose m(y, µ) = g(y) µ to make sure that the HL incorporates aspects of µ = E g(y i ). Favourite case: For a given focus parameter ψ = ψ(f ), use this as the single control parameter. For a given focus parameter ψ = ψ(f ), may also select among candidate µ j controls via FIC schemes. May stretch the idea, including a slowly increasing sequence of µ 1, µ 2,..., with a FIC (or AFIC) stopping criterion.

24 24/25 G: Concluding remarks (and questions) A. The methodology works for multidimensional data y i, and can be extended to regression settings. B. We fine-tune the balance parameter a by minimising the curve risk(a) over [0, 1]. If the model gives a good fit, risk(a) is minimal at a = 0, and we use the ML, after all. This is also an implied goodness-of-fit test. C. So far: large-sample approximation framework and methodology, with fixed p (dimension of θ), q (number of control parameters), r (number of extra γ j model extension parameters). It is of interest to let these grow with n but more difficult mathematically.

25 D. Instead of h n (θ) = (1 a)l n (θ) + a log R n (µ(θ)), may work with a certain cousin operation, with hn (θ) = (1 a)l n (θ) + a log R n (µ(θ)), log R n (µ(θ)) = 1 2 V n(µ(θ)) t W n (µ(θ)) 1 V n (µ(θ)), i.e. using the quadratic approximation to the EL rather than using the EL itself. This is (a) easier, numerically and mathematically; (b) equivalent, up to O(1/ n) neighbourhoods; (c) valid more generally. This gives (with some efforts) a minimum divergence method based on with d(f, f θ ) = (1 a)kl(f, f θ ) + ad Q (f, f θ ), d Q (f, f θ ) = 1 2 (µ θ µ 0 ) t Ω 1 0 (µ θ µ 0 ) 1 + (µ θ µ 0 ) t Ω 1 0 (µ θ µ 0 ). 25/25

Hybrid combinations of parametric and empirical likelihoods

Hybrid combinations of parametric and empirical likelihoods Nils Lid Hjort 1, Ian W. McKeague 2, and Ingrid Van Keilegom 3 University of Oslo, Columbia University, and KU Leuven Abstract: This paper develops