Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Size: px

Start display at page:

Download "Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach"

Karen Allen
5 years ago
Views:

1 Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Jae-Kwang Kim Department of Statistics, Iowa State University

2 Outline 1 Introduction 2 Observed likelihood 3 Mean Score Approach 4 Observed Information Jae-Kwang Kim (ISU) Ch 2 P1 2 / 52

3 1. Introduction - Basic Setup (No missing data) y = (y 1, y 2, y n) is a realization of the random sample from an infinite population with density f (y). Assume that the true density f (y) belongs to a parametric family of densities P = {f (y; θ) ; θ Ω} indexed by θ Ω. That is, there exist θ 0 Ω such that f (y; θ 0) = f (y) for all y. Jae-Kwang Kim (ISU) Ch 2 P1 3 / 52

4 Likelihood Definitions for likelihood theory The likelihood function of θ is defined as where f (y; θ) is the joint pdf of y. L(θ) = f (y; θ) Let ˆθ be the maximum likelihood estimator (MLE) of θ 0 if it satisfies L(ˆθ) = max θ Θ L(θ). A parametric family of densities, P = {f (y; θ); θ Θ}, is identifiable if for all y, f (y; θ 1) f (y; θ 2) for every θ 1 θ 2. Jae-Kwang Kim (ISU) Ch 2 P1 4 / 52

5 Lemma 2.1 Properties of identifiable distribution Lemma 2.1. If P = {f (y; θ) ; θ Ω} is identifiable and E { ln f (y; θ) } < for all θ, then [ ] f (y; θ) M (θ) = E θ0 ln f (y; θ 0) has a unique maximum at θ 0. [Proof] Let Z = f (y; θ) /f (y; θ 0). Use the strict version of Jensen s inequality ln [E (Z)] < E [ ln (Z)]. Because E θ0 (Z) = f (y; θ) dy = 1, we have ln [E (Z)] = 0. Jae-Kwang Kim (ISU) Ch 2 P1 5 / 52

6 Remark 1 In Lemma 2.1, M (θ) is called the Kullback-Leibler divergence measure of f (y; θ) from f (y; θ 0). It is often considered a measure of distance between the two distributions. 2 Lemma 2.1 simply says that, under identifiability, Q(θ) = E θ0 {log f (Y ; θ)} takes the (unique) maximum value at θ = θ 0. 3 Define Q n(θ) = n 1 n log f (y i ; θ) then the MLE ˆθ is the maximizer of Q n(θ). Since Q n(θ) converges in probability to Q(θ), can we say that the maximizer of Q n(θ) converges to the maximizer of Q(θ)? i=1 Jae-Kwang Kim (ISU) Ch 2 P1 6 / 52

7 Theorem 2.1: (Weak) consistency of MLE Theorem 2.1 Let and ˆθ be any solution of Q n (θ) = 1 n Assume the following two conditions: 1 Uniform weak convergence: for some non-stochastic function Q (θ) n log f (y i, θ) i=1 Q n(ˆθ) = max Qn (θ). θ Ω p sup Q n (θ) Q (θ) 0 θ Ω 2 Identification: Q (θ) is uniquely maximized at θ 0. Then, ˆθ p θ 0. Jae-Kwang Kim (ISU) Ch 2 P1 7 / 52

8 Remark 1 Convergence in probability: for any ɛ > 0. ˆθ p θ 0 P { } ˆθ θ 0 > ɛ 0 as n, 2 If P = {f (y; θ) ; θ Ω} is not identifiable, then Q (θ) may not have a unique maximum and ˆθ may not converge (in probability) to a single point. 3 If the true distribution f (y) does not belong to the class P = {f (y; θ) ; θ Ω}, which point does ˆθ converge to? Jae-Kwang Kim (ISU) Ch 2 P1 8 / 52

9 Example: log-likelihood function for N(θ, 1) Log-likelihood Under assumption N(θ, 1): l n(θ) = n log f (y i ; θ) i=1 Plot of l n(θ) on θ: Jae-Kwang Kim (ISU) Ch 2 P1 9 / 52

10 Other properties of MLE Asymptotic normality (Theorem 2.2) Asymptotic optimality: MLE achieves Cramer-Rao lower bound Wilks theorem: 2{l n(ˆθ) l n(θ 0)} d χ 2 p Jae-Kwang Kim (ISU) Ch 2 P1 10 / 52

11 1 Introduction - Fisher information Definition 1 Score function: S(θ) = log L(θ) θ 2 Fisher information = curvature of the log-likelihood: I(θ) = 2 log L(θ) = θ θt θ S(θ) T 3 Observed (Fisher) information: I(ˆθ n) where ˆθ n is the MLE. 4 Expected (Fisher) information: I(θ) = E θ {I(θ)} The observed information is always positive. The observed information applies to a single dataset. The expected information is meaningful as a function of θ across the admissible values of θ. The expected information is an average quantity over all possible datasets. I(ˆθ) = I(ˆθ) for exponential family. Jae-Kwang Kim (ISU) Ch 2 P1 11 / 52

12 Properties of score function Theorem 2.3. Properties of score functions Under the regularity conditions allowing the exchange of the order of integration and differentiation, E θ {S(θ)} = 0 and V θ {S(θ)} = I(θ). Jae-Kwang Kim (ISU) Ch 2 P1 12 / 52

13 Proof Jae-Kwang Kim (ISU) Ch 2 P1 13 / 52

14 1 Introduction - Fisher information Remark Since ˆθ p θ 0, we can apply a Taylor linearization on S(ˆθ) = 0 to get ˆθ θ 0 = {I(θ0)} 1 S(θ 0). Here, we use the fact that I (θ) = S(θ)/ θ T converges in probability to I(θ). Thus, the (asymptotic) variance of MLE is V (ˆθ). = {I(θ 0)} 1 V {S(θ 0)} {I(θ 0)} 1 = {I(θ 0)} 1, where the last equality follows from Theorem 2.3. Jae-Kwang Kim (ISU) Ch 2 P1 14 / 52

15 Section 2: Observed likelihood 2. Observed likelihood Jae-Kwang Kim (ISU) Ch 2 P1 15 / 52

16 Basic Setup 1 Let y = (y 1, y 2, y n) be a realization of the random sample from an infinite population with density f (y; θ 0). 2 Let δ i be an indicator function defined by { 1 if yi is observed δ i = 0 otherwise and we assume that Pr (δ i = 1 y i ) = π (y i, φ) for some (unknown) parameter φ and π( ) is a known function. 3 Thus, instead of observing (δ i, y i ) directly, we observe (δ i, y i,obs ) { yi if δ i = 1 y i,obs = if δ i = 0. 4 What is the (marginal) density of (y i,obs, δ i )? Jae-Kwang Kim (ISU) Ch 2 P1 16 / 52

17 Motivation (Change of variable technique) 1 Suppose that z is a random variable with density f (z). 2 Instead of observing z directly, we observe only y = y (z), where the mapping z y (z) is known. 3 The density of y is where R (y) = {z; y (z) = y}. g (y) = R(y) f (z) dz Jae-Kwang Kim (ISU) Ch 2 P1 17 / 52

18 Derivation 1 The original distribution for z = (y, δ) is f (y, δ) = f 1 (y) f 2 (δ y) where f 1 (y) is the density of y and f 2 (δ y) is the conditional density of δ conditional on y and is given by f 2 (δ y) = {π(y)} δ {1 π(y)} 1 δ, where π(y) = Pr (δ = 1 y). 2 Instead of observing z = (y, δ) directly, we observe only (y obs, δ) where y obs = y obs (y, δ) and the mapping (y, δ) y obs is known. 3 The (marginal) density of (y obs, δ) is g (y obs, δ) = where R (y obs, δ) = {y; y obs (y, δ) = y obs }. R(y obs,δ) f (y, δ) dy Jae-Kwang Kim (ISU) Ch 2 P1 18 / 52

19 Likelihood under missing data (=Observed likelihood) The observed likelihood is the likelihood obtained from the marginal density of (y i,obs, δ i ), i = 1, 2,, n, and can be written as, under the IID setup, L obs (θ, φ) = [f 1 (y i ; θ) f 2 (δ i y i ; φ)] [ ] f 1 (y i ; θ) f 2 (δ i y i ; φ) dy i δ i =1 δ i =0 = [f 1 (y i ; θ) π (y i ; φ)] [ ] f 1 (y; θ) {1 π (y; φ)} dy, δ i =1 δ i =0 where π (y; φ) = P (δ = 1 y; φ). Jae-Kwang Kim (ISU) Ch 2 P1 19 / 52

20 Example 2.2 (Censored regression model, or Tobit model) z i = x i β + ɛ i y i = ɛ i N (0, σ 2) { zi if z i 0 0 if z i < 0. The observed log-likelihood is [ ] ( l β, σ 2) = 1 ln 2π + ln σ 2 + (y i x i β) 2 2 σ 2 y i >0 where Φ (x) is the cdf of the standard normal distribution. + [ ( )] x ln 1 Φ i β σ y i =0 Jae-Kwang Kim (ISU) Ch 2 P1 20 / 52

21 Example 2.3 Let t 1, t 2,, t n be an IID sample from a distribution with density f θ (t) = θe θt I (t > 0). Instead of observing t i, we observe (y i, δ i ) where { ti if δ i = 1 y i = c if δ i = 0 and δ i = { 1 if ti c 0 if t i > c, where c is a known censoring time. The observed likelihood for θ can be derived as L obs (θ) = n i=1 [ ] {f θ (t i )} δ i {P (t i > c)} 1 δ i = θ n i=1 δ i exp( θ n y i ). i=1 Jae-Kwang Kim (ISU) Ch 2 P1 21 / 52

22 Multivariate extension Basic Setup Let y = (y 1,..., y p) be a p-dimensional random vector with probability density function f (y; θ). { 1 if yij is observed Let δ ij be the response indicator function of y ij with δ ij = 0 otherwise. δ i = (δ i1,, δ ip ): p-dimensional random vector with density P(δ y) assuming P(δ y) = P(δ y; φ) for some φ. Let (y i,obs, y i,mis ) be the observed part and missing part of y i, respectively. Let R(y obs, δ) = {y; y obs (y i, δ i ) = y i,obs, i = 1,..., n} be the set of all possible values of y with the same realized value of y obs, for given δ, where y obs (y i, δ i ) is a function that gives the value of y ij for δ ij = 1. Jae-Kwang Kim (ISU) Ch 2 P1 22 / 52

23 Definition: Observed likelihood Under the above setup, the observed likelihood of (θ, φ) is L obs (θ, φ) = f (y; θ)p(δ y; φ)dy. R(y obs,δ) Under IID setup: The observed likelihood is n [ ] L obs (θ, φ) = f (y i ; θ)p(δ i y i ; φ)dy i,mis, i=1 where it is understood that, if y i = y i,obs and y i,mis is empty then there is nothing to integrate out. In the special case of scalar y, the observed likelihood is L obs (θ, φ) = [f (y i ; θ)π(y i ; φ)] [ ] f (y; θ) {1 π(y; φ)} dy, δ i =1 δ i =0 where π(y; φ) = P(δ = 1 y; φ). Jae-Kwang Kim (ISU) Ch 2 P1 23 / 52

24 Missing At Random Definition: Missing At Random (MAR) P(δ y) is the density of the conditional distribution of δ given y. Let y obs = y obs (y, δ) where { yi if δ i = 1 y i,obs = if δ i = 0. The response mechanism is MAR if P (δ y 1 ) = P (δ y 2 ) { or P(δ y) = P(δ y obs )} for all y 1 and y 2 satisfying y obs (y 1, δ) = y obs (y 2, δ). MAR: the response mechanism P(δ y) depends on y only through y obs. Let y = (y obs, y mis ). By Bayes theorem, P (y mis y obs, δ) = P(δ y mis, y obs ) P (y P(δ y obs ) mis y obs ). MAR: P(y mis y obs, δ) = P(y mis y obs ). That is, y mis δ y obs. MAR: the conditional independence of δ and y mis given y obs. Jae-Kwang Kim (ISU) Ch 2 P1 24 / 52

25 Remark MCAR (Missing Completely at random): P(δ y) does not depend on y. MAR (Missing at random): P(δ y) = P(δ y obs ) NMAR (Not Missing at random): P(δ y) P(δ y obs ) Thus, MCAR is a special case of MAR. Jae-Kwang Kim (ISU) Ch 2 P1 25 / 52

26 Likelihood factorization theorem Theorem 2.4 (Rubin, 1976) P φ (δ y) is the joint density of δ given y and f θ (y) is the joint density of y. Under conditions 1 the parameters θ and φ are distinct and 2 MAR condition holds, the observed likelihood can be written as L obs (θ, φ) = L 1(θ)L 2(φ), and the MLE of θ can be obtained by maximizing L 1(θ). Thus, we do not have to specify the model for response mechanism. The response mechanism is called ignorable if the above likelihood factorization holds. Jae-Kwang Kim (ISU) Ch 2 P1 26 / 52

27 Example 2.4 Bivariate data (x i, y i ) with pdf f (x, y) = f 1(y x)f 2(x). x i is always observed and y i is subject to missingness. Assume that the response status variable δ i of y i satisfies for some function Λ 1 ( ) of known form. P (δ i = 1 x i, y i ) = Λ 1 (φ 0 + φ 1x i + φ 2y i ) Let θ be the parameter of interest in the regression model f 1 (y x; θ). Let α be the parameter in the marginal distribution of x, denoted by f 2 (x i ; α). Define Λ 0(x) = 1 Λ 1(x). Three parameters θ: parameter of interest α and φ: nuisance parameter Jae-Kwang Kim (ISU) Ch 2 P1 27 / 52

28 Example 2.4 (Cont d) Observed likelihood L obs (θ, α, φ) = f 1 (y i x i ; θ) f 2 (x i ; α) Λ 1 (φ 0 + φ 1x i + φ 2y i ) δi =1 δi =0 f 1 (y x i ; θ) f 2 (x i ; α) Λ 0 (φ 0 + φ 1x i + φ 2y) dy = L 1 (θ, φ) L 2 (α) where L 2 (α) = n i=1 f2 (x i; α). Thus, we can safely ignore the marginal distribution of x if x is completely observed. Jae-Kwang Kim (ISU) Ch 2 P1 28 / 52

29 Example 2.4 (Cont d) If φ 2 = 0, then MAR holds and L 1 (θ, φ) = L 1a (θ) L 1b (φ) where and L 1a (θ) = f 1 (y i x i ; θ) δ i =1 L 1b (φ) = Λ 1 (φ 0 + φ 1x i ) Λ 0 (φ 0 + φ 1x i ). δ i =1 δ i =0 Thus, under MAR, the MLE of θ can be obtained by maximizing L 1a (θ), which is obtained by ignoring the missing part of the data. Jae-Kwang Kim (ISU) Ch 2 P1 29 / 52

30 Example 2.4 (Cont d) Instead of y i subject to missingness, if x i is subject to missingness, then the observed likelihood becomes L obs (θ, φ, α) = f 1 (y i x i ; θ) f 2 (x i ; α) Λ 1 (φ 0 + φ 1x i + φ 2y i ) δi =1 δi =0 f 1 (y i x; θ) f 2 (x; α) Λ 0 (φ 0 + φ 1x + φ 2y i ) dx If φ 1 = 0 then L 1 (θ, φ) L 2 (α). L obs (θ, α, φ) = L 1 (θ, α) L 2 (φ) and MAR holds. Although we are not interested in the marginal distribution of x, we still need to specify the model for the marginal distribution of x. Jae-Kwang Kim (ISU) Ch 2 P1 30 / 52

31 Chapter 2: Part 2 3. Mean Score Approach Jae-Kwang Kim (ISU) Ch 2 P1 31 / 52

32 3 Mean Score Approach The observed likelihood is the marginal density of (y obs, δ). The observed likelihood is L obs (η) = f (y; θ)p(δ y; φ)dµ(y) = R(y obs,δ) f (y; θ)p(δ y; φ)dµ(y mis ) where y mis is the missing part of y and η = (θ, φ). Observed score equation: S obs (η) η log L obs(η) = 0 Computing the observed score function can be computationally challenging because the observed likelihood is an integral form. Jae-Kwang Kim (ISU) Ch 2 P1 32 / 52

33 3 Mean Score Approach Theorem 2.5: Mean Score Theorem (Fisher, 1922) Under some regularity conditions, the observed score function equals to the mean score function. That is, S obs (η) = S(η) where S(η) = E{S com(η) y obs, δ} S com(η) = log f (y, δ; η), η f (y, δ; η) = f (y; θ)p(δ y; φ). The mean score function is computed by taking the conditional expectation of the complete-sample score function given the observation. The mean score function is easier to compute than the observed score function. Jae-Kwang Kim (ISU) Ch 2 P1 33 / 52

34 3 Mean Score Approach Proof of Theorem 2.5 Since L obs (η) = f (y, δ; η) /f (y, δ y obs, δ; η), we have η ln L obs (η) = ln f (y, δ; η) η η ln f (y, δ y obs, δ; η), taking a conditional expectation of the above equation over the conditional distribution of (y, δ) given (y obs, δ), we have { } η ln L obs (η) = E η ln L obs (η) y obs, δ { } = E {S com (η) y obs, δ} E η ln f (y, δ y obs, δ; η) y obs, δ. Here, the first equality holds because L obs (η) is a function of (y obs, δ) only. The last term is equal to zero by Theorem 2.3, which states that the expected value of the score function is zero and the reference distribution in this case is the conditional distribution of (y, δ) given (y obs, δ). Jae-Kwang Kim (ISU) Ch 2 P1 34 / 52

35 Example Suppose that the study variable y follows from a normal distribution with mean x β and variance σ 2. The score equations for β and σ 2 under complete response are S 1(β, σ 2 ) = n ( yi x i β ) x i /σ 2 = 0 i=1 and S 2(β, σ 2 ) = n/(2σ 2 ) + n ( yi x i β )2 /(2σ 4 ) = 0. 2 Assume that y i are observed only for the first r elements and the MAR assumption holds. In this case, the mean score function reduces to S 1(β, σ 2 ) = i=1 r ( yi x i β ) x i /σ 2 i=1 and r S 2(β, σ 2 ) = n/(2σ 2 ( ) + yi x i β )2 /(2σ 4 ) + (n r)/(2σ 2 ). i=1 Jae-Kwang Kim (ISU) Ch 2 P1 35 / 52

36 Example 2.6 (Cont d) 3 The maximum likelihood estimator obtained by solving the mean score equations is ( r ) 1 r ˆβ = x i x i x i y i and ˆσ 2 = 1 r i=1 r i=1 i=1 ( y i x i ˆβ) 2. Thus, the resulting estimators can be also obtained by simply ignoring the missing part of the sample, which is consistent with the result in Example 2.4 (for φ 2 = 0). Jae-Kwang Kim (ISU) Ch 2 P1 36 / 52

37 Discussion of Example 2.6 We are interested in estimating θ for the conditional density f (y x; θ). Under MAR, the observed likelihood for θ is r n L obs (θ) = f (y i x i ; θ) f (y x i ; θ)dy = i=1 i=r+1 r f (y i x i ; θ). The same conclusion can follow from the mean score theorem. Under MAR, the mean score function is r n S(θ) = S(θ; x i, y i ) + E{S(θ; x i, Y ) x i } = i=1 r S(θ; x i, y i ) i=1 i=r+1 where S(θ; x, y) is the score function for θ and the second equality follows from Theorem 2.3. i=1 Jae-Kwang Kim (ISU) Ch 2 P1 37 / 52

38 Example Suppose that the study variable y is randomly distributed with Bernoulli distribution with probability of success p i, where p i = p i (β) = exp (x i β) 1 + exp (x i β) for some unknown parameter β and x i is a vector of the covariates in the logistic regression model for y i. We assume that 1 is in the column space of x i. 2 Under complete response, the score function for β is S 1(β) = n (y i p i (β)) x i. i=1 Jae-Kwang Kim (ISU) Ch 2 P1 38 / 52

39 Example 2.5 (Cont d) 3 Let δ i be the response indicator function for y i with distribution Bernoulli(π i ) where π i = exp (x i φ 0 + y i φ 1) 1 + exp (x i φ0 + y iφ 1). We assume that x i is always observed, but y i is missing if δ i = 0. 4 Under missing data, the mean score function for β is S 1 (β, φ) = δi =1 {y i p i (β)} x i + δ i =0 1 w i (y; β, φ) {y p i (β)} x i, where w i (y; β, φ) is the conditional probability of y i = y given x i and δ i = 0: y=0 w i (y; β, φ) = P β (y i = y x i ) P φ (δ i = 0 y i = y, x i ) 1 z=0 P β (y i = z x i ) P φ (δ i = 0 y i = z, x i ) Thus, S 1 (β, φ) is also a function of φ. Jae-Kwang Kim (ISU) Ch 2 P1 39 / 52

40 Example 2.5 (Cont d) 5 If the response mechanism is MAR so that φ 1 = 0, then w i (y; β, φ) = P β (y i = y x i ) 1 z=0 P β (y i = z x i ) = P β (y i = y x i ) and so S 1 (β, φ) = δi =1 {y i p i (β)} x i = S 1 (β). 6 If MAR does not hold, then ( ˆβ, ˆφ) can be obtained by solving S 1 (β, φ) = 0 and S 2(β, φ) = 0 jointly, where S 2 (β, φ) = δi =1 {δ i π (φ; x i, y i )} (x i, y i ) + δ i =0 1 w i (y; β, φ) {δ i π i (φ; x i, y)} (x i, y). y=0 Jae-Kwang Kim (ISU) Ch 2 P1 40 / 52

41 Discussion of Example 2.5 We may not have a unique solution to S(η) = 0, where S(η) = [ S1 (β, φ), S 2(β, φ) ] when MAR does not hold, because of the non-identifiability problem associated with non-ignorable missing. To avoid this problem, often a reduced model is used for the response model. Pr (δ = 1 x, y) = Pr (δ = 1 u, y) where x = (u, z). The reduced response model introduces a smaller set of parameters and the over-identified situation can be resolved. (More discussion will be made in Chapter 6.) Computing the solution to S(η) is also difficult. EM algorithm, which will be discussed in Chapter 3, is a useful computational tool. Jae-Kwang Kim (ISU) Ch 2 P1 41 / 52

42 2.4 Observed information Jae-Kwang Kim (ISU) Ch 2 P1 42 / 52

43 4 Observed information Definition 1 Observed score function: S obs (η) = η log L obs(η) 2 Fisher information from observed likelihood: I obs (η) = 2 log L η η T obs (η) 3 Expected (Fisher) information from observed likelihood: I obs (η) = E η {I obs (η)}. Theorem 2.6 Under regularity conditions, E{S obs (η)} = 0, and V{S obs (η)} = I obs (η), where I obs (η) = E η{i obs (η)} is the expected information from the observed likelihood. Jae-Kwang Kim (ISU) Ch 2 P1 43 / 52

44 4 Observed information Under missing data, the MLE ˆη is the solution to S(η) = 0. Under some regularity conditions, ˆη converges in probability to η 0 and has the asymptotic variance {I obs (η 0)} 1 with { I obs (η) = E } { } { } η S obs(η) = E S 2 T obs (η) = E S 2 (η) & B 2 = BB T. For variance estimation of ˆη, may use {I obs (ˆη)} 1. In general, I obs (ˆη) is preferred to I obs (ˆη) for variance estimation of ˆη. Jae-Kwang Kim (ISU) Ch 2 P1 44 / 52

45 Return to Example 2.3 Observed log-likelihood ln L obs (θ) = n δ i log(θ) θ n i=1 i=1 y i MLE for θ: ˆθ = n i=1 y i n i=1 δ i Fisher information: I obs (θ) = n i=1 δ i/θ 2 Expected information: I obs (θ) = n i=1 (1 e θc )/θ 2 = n(1 e θc )/θ 2. Which one do you prefer? Jae-Kwang Kim (ISU) Ch 2 P1 45 / 52

46 4 Observed information Motivation L com(η) = f (y, δ; η): complete-sample likelihood with no missing data Fisher information associated with L com(η): I com(η) = η Scom(η) = 2 log Lcom(η) T η ηt L obs (η): the observed likelihood Fisher information associated with L obs (η): I obs (η) = η S obs(η) = 2 T η η log L obs(η) T How to express I obs (η) in terms of I com(η) and S com(η)? Jae-Kwang Kim (ISU) Ch 2 P1 46 / 52

47 Louis formula Theorem 2.7 (Louis, 1982; Oakes, 1999) Under regularity conditions allowing the exchange of the order of integration and differentiation, [ I obs (η) = E{I com(η) y obs, δ} E{Scom(η) y 2 obs, δ} S(η) 2] where S(η) = E{S com(η) y obs, δ}. = E{I com(η) y obs, δ} V {S com(η) y obs, δ}, Jae-Kwang Kim (ISU) Ch 2 P1 47 / 52

48 Proof of Theorem 2.7 By Theorem 2.5, the observed information associated with L obs (η) can be expressed as I obs (η) = η S (η) where S (η) = E {S com(η) y obs, δ; η}. Thus, we have η S (η) = S com(η; y)f (y, δ y η obs, δ; η) dµ(y) { } = Scom(η; y) f (y, δ y η obs, δ; η) dµ(y) { } + S com(η; y) η f (y, δ y obs, δ; η) dµ(y) = E { S com(η)/ η y obs, δ } { } + S com(η; y) η log f (y, δ y obs, δ; η) f (y, δ y obs, δ; η) dµ(y). The first term is equal to E {I com(η) y obs, δ} and the second term is equal to E { S com(η)s mis(η) y obs, δ } = E [{ S(η) } + Smis(η) S mis(η) y obs, δ ] = E { S mis(η)s mis(η) y obs, δ } because E { S(η)Smis(η) y obs, δ } = 0. Jae-Kwang Kim (ISU) Ch 2 P1 48 / 52

49 Missing Information Principle S mis (η): the score function with the conditional density f (y, δ y obs, δ). { } Expected missing information: I mis (η) = E S η T mis (η) I mis (η) = E { S mis (η) 2}. Missing information principle (Orchard and Woodbury, 1972): I mis (η) = I com(η) I obs (η), satisfying where I com(η) = E { S com(η)/ η T } is the expected information with complete-sample likelihood. An alternative expression of the missing information principle is V{S mis (η)} = V{S com(η)} V{ S(η)}. Note that V{S com(η)} = I com(η) and V{S obs (η)} = I obs (η). Jae-Kwang Kim (ISU) Ch 2 P1 49 / 52

50 Example Consider the following bivariate normal distribution: ( ) [( ) ( y1i µ1 σ11 σ 12 N, µ 2 σ 12 σ 22 y 2i )], for i = 1, 2,, n. Assume for simplicity that σ 11, σ 12 and σ 22 are known constants and µ = (µ 1, µ 2) be the parameter of interest. 2 The complete sample score function for µ is n n ( ) 1 ( S com (µ) = S com (i) σ11 σ 12 y1i µ 1 (µ) = σ 12 σ 22 y 2i µ 2 i=1 The information matrix of µ based on the complete sample is ( ) 1 σ11 σ 12 I com (µ) = n. σ 12 σ 22 i=1 ). Jae-Kwang Kim (ISU) Ch 2 P1 50 / 52

51 Example 2.7 (Cont d) 3 Suppose that there are some missing values in y 1i and y 2i and the original sample is partitioned into four sets: H = both y 1 and y 2 respond K = only y 1 is observed L = only y 2 is observed M = both y 1 and y 2 are missing. Let n H, n K, n L, n M represent the size of H, K, L, M, respectively. 4 Assume that the response mechanism does not depend on the value of (y 1, y 2) and so it is MAR. In this case, the observed score function of µ based on a single observation in set K is E { } S com (i) (µ) y 1i, i K ( ) 1 ( ) σ11 σ 12 y1i µ 1 = σ 12 σ 22 E(y 2i y 1i ) µ 2 ( σ 1 11 = (y ) 1i µ 1). 0 Jae-Kwang Kim (ISU) Ch 2 P1 51 / 52

52 Example 2.7 (Cont d) 5 Similarly, we have E { } S com (i) (µ) y 2i, i L = ( 0 σ 1 22 (y 2i µ 2) ). 6 Therefore, the observed information matrix of µ is I obs (µ) = n H ( σ11 σ 12 σ 12 σ 22 ) 1 + n K ( σ ) + n L ( σ 1 22 and the asymptotic variance of the MLE of µ can be obtained by the inverse of I obs (µ). ) Jae-Kwang Kim (ISU) Ch 2 P1 52 / 52

53 REFERENCES Fisher, R. A. (1922), On the mathematical foundations of theoretical statistics, Philosophical Transactions of the Royal Society of London A 222, Louis, T. A. (1982), Finding the observed information matrix when using the EM algorithm, Journal of the Royal Statistical Society: Series B 44, Oakes, D. (1999), Direct calculation of the information matrix via the em algorithm, Journal of the Royal Statistical Society: Series B 61, Orchard, T. and M.A. Woodbury (1972), A missing information principle: theory and applications, in Proceedings of the 6th Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, University of California Press, Berkeley, California, pp Rubin, D. B. (1976), Inference and missing data, Biometrika 63, Jae-Kwang Kim (ISU) Ch 2 P1 52 / 52

Statistical Methods for Handling Missing Data

Statistical Methods for Handling Missing Data Jae-Kwang Kim Department of Statistics, Iowa State University July 5th, 2014 Outline Textbook : Statistical Methods for handling incomplete data by Kim and