Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Similar documents
Statistical Methods for Handling Missing Data

Recent Advances in the analysis of missing data with non-ignorable missingness

Fractional Imputation in Survey Sampling: A Comparative Review

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

Likelihood-based inference with missing data under missing-at-random

Chapter 4: Imputation

6. Fractional Imputation in Survey Sampling

An Efficient Estimation Method for Longitudinal Surveys with Monotone Missing Data

Parametric fractional imputation for missing data analysis

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

A note on multiple imputation for general purpose estimation

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

Shu Yang and Jae Kwang Kim. Harvard University and Iowa State University

1. Fisher Information

Introduction to Estimation Methods for Time Series models Lecture 2

EM Algorithm II. September 11, 2018

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Introduction An approximated EM algorithm Simulation studies Discussion

Graduate Econometrics I: Maximum Likelihood II

Outline of GLMs. Definitions

Model comparison and selection

Chapter 3: Maximum Likelihood Theory

Chapter 7. Hypothesis Testing

Likelihood based Statistical Inference. Dottorato in Economia e Finanza Dipartimento di Scienze Economiche Univ. di Verona

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Chapter 3. Point Estimation. 3.1 Introduction

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Chapter 3 : Likelihood function and inference

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

simple if it completely specifies the density of x

analysis of incomplete data in statistical surveys

Graduate Econometrics I: Maximum Likelihood I

Final Exam. 1. (6 points) True/False. Please read the statements carefully, as no partial credit will be given.

Maximum Likelihood Estimation

Nonresponse weighting adjustment using estimated response probability

Econometrics I, Estimation

DA Freedman Notes on the MLE Fall 2003

Maximum Likelihood Estimation

Lecture 1: Introduction

Optimization Methods II. EM algorithms.

A Very Brief Summary of Statistical Inference, and Examples

Review and continuation from last week Properties of MLEs

Likelihood inference in the presence of nuisance parameters

Brief Review on Estimation Theory

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Estimation of Dynamic Regression Models

Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton

Statistical Inference

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes

Biostat 2065 Analysis of Incomplete Data

System Identification, Lecture 4

Mathematics Ph.D. Qualifying Examination Stat Probability, January 2018

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

Statistical Methods. Missing Data snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23

Gaussian Processes 1. Schedule

Master s Written Examination

Problem Selected Scores

Analysis of Gamma and Weibull Lifetime Data under a General Censoring Scheme and in the presence of Covariates

Information in a Two-Stage Adaptive Optimal Design

Primal-dual Covariate Balance and Minimal Double Robustness via Entropy Balancing

Master s Written Examination

Posterior Regularization

Semiparametric posterior limits

A Few Notes on Fisher Information (WIP)

A Very Brief Summary of Statistical Inference, and Examples

System Identification, Lecture 4

Chapter 5: Models used in conjunction with sampling. J. Kim, W. Fuller (ISU) Chapter 5: Models used in conjunction with sampling 1 / 70

Information in Data. Sufficiency, Ancillarity, Minimality, and Completeness

Recap. Vector observation: Y f (y; θ), Y Y R m, θ R d. sample of independent vectors y 1,..., y n. pairwise log-likelihood n m. weights are often 1

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided

Likelihood Inference in the Presence of Nuisance Parameters

Central Limit Theorem ( 5.3)

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Likelihood Inference in the Presence of Nuisance Parameters

Basic math for biology

Lecture 7 Introduction to Statistical Decision Theory

Various types of likelihood

Lecture 25: Review. Statistics 104. April 23, Colin Rundel

Variational Bayesian Inference for Parametric and Non-Parametric Regression with Missing Predictor Data

Multivariate Statistics

1 Glivenko-Cantelli type theorems

Some General Types of Tests

Estimation and Model Selection in Mixed Effects Models Part I. Adeline Samson 1

Lecture 3 September 1

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

Machine learning - HT Maximum Likelihood

Combining Non-probability and Probability Survey Samples Through Mass Imputation

Comparison between conditional and marginal maximum likelihood for a class of item response models

Inference in non-linear time series

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

MCMC algorithms for fitting Bayesian models

STAT 730 Chapter 4: Estimation

Principles of Statistics

Maximum Likelihood Estimation

Linear Methods for Prediction

1 One-way analysis of variance

Maximum Likelihood Tests and Quasi-Maximum-Likelihood

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Gauge Plots. Gauge Plots JAPANESE BEETLE DATA MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA JAPANESE BEETLE DATA

Transcription:

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Jae-Kwang Kim Department of Statistics, Iowa State University

Outline 1 Introduction 2 Observed likelihood 3 Mean Score Approach 4 Observed Information Jae-Kwang Kim (ISU) Ch 2 P1 2 / 52

1. Introduction - Basic Setup (No missing data) y = (y 1, y 2, y n) is a realization of the random sample from an infinite population with density f (y). Assume that the true density f (y) belongs to a parametric family of densities P = {f (y; θ) ; θ Ω} indexed by θ Ω. That is, there exist θ 0 Ω such that f (y; θ 0) = f (y) for all y. Jae-Kwang Kim (ISU) Ch 2 P1 3 / 52

Likelihood Definitions for likelihood theory The likelihood function of θ is defined as where f (y; θ) is the joint pdf of y. L(θ) = f (y; θ) Let ˆθ be the maximum likelihood estimator (MLE) of θ 0 if it satisfies L(ˆθ) = max θ Θ L(θ). A parametric family of densities, P = {f (y; θ); θ Θ}, is identifiable if for all y, f (y; θ 1) f (y; θ 2) for every θ 1 θ 2. Jae-Kwang Kim (ISU) Ch 2 P1 4 / 52

Lemma 2.1 Properties of identifiable distribution Lemma 2.1. If P = {f (y; θ) ; θ Ω} is identifiable and E { ln f (y; θ) } < for all θ, then [ ] f (y; θ) M (θ) = E θ0 ln f (y; θ 0) has a unique maximum at θ 0. [Proof] Let Z = f (y; θ) /f (y; θ 0). Use the strict version of Jensen s inequality ln [E (Z)] < E [ ln (Z)]. Because E θ0 (Z) = f (y; θ) dy = 1, we have ln [E (Z)] = 0. Jae-Kwang Kim (ISU) Ch 2 P1 5 / 52

Remark 1 In Lemma 2.1, M (θ) is called the Kullback-Leibler divergence measure of f (y; θ) from f (y; θ 0). It is often considered a measure of distance between the two distributions. 2 Lemma 2.1 simply says that, under identifiability, Q(θ) = E θ0 {log f (Y ; θ)} takes the (unique) maximum value at θ = θ 0. 3 Define Q n(θ) = n 1 n log f (y i ; θ) then the MLE ˆθ is the maximizer of Q n(θ). Since Q n(θ) converges in probability to Q(θ), can we say that the maximizer of Q n(θ) converges to the maximizer of Q(θ)? i=1 Jae-Kwang Kim (ISU) Ch 2 P1 6 / 52

Theorem 2.1: (Weak) consistency of MLE Theorem 2.1 Let and ˆθ be any solution of Q n (θ) = 1 n Assume the following two conditions: 1 Uniform weak convergence: for some non-stochastic function Q (θ) n log f (y i, θ) i=1 Q n(ˆθ) = max Qn (θ). θ Ω p sup Q n (θ) Q (θ) 0 θ Ω 2 Identification: Q (θ) is uniquely maximized at θ 0. Then, ˆθ p θ 0. Jae-Kwang Kim (ISU) Ch 2 P1 7 / 52

Remark 1 Convergence in probability: for any ɛ > 0. ˆθ p θ 0 P { } ˆθ θ 0 > ɛ 0 as n, 2 If P = {f (y; θ) ; θ Ω} is not identifiable, then Q (θ) may not have a unique maximum and ˆθ may not converge (in probability) to a single point. 3 If the true distribution f (y) does not belong to the class P = {f (y; θ) ; θ Ω}, which point does ˆθ converge to? Jae-Kwang Kim (ISU) Ch 2 P1 8 / 52

Example: log-likelihood function for N(θ, 1) Log-likelihood Under assumption N(θ, 1): l n(θ) = n log f (y i ; θ) i=1 Plot of l n(θ) on θ: Jae-Kwang Kim (ISU) Ch 2 P1 9 / 52

Other properties of MLE Asymptotic normality (Theorem 2.2) Asymptotic optimality: MLE achieves Cramer-Rao lower bound Wilks theorem: 2{l n(ˆθ) l n(θ 0)} d χ 2 p Jae-Kwang Kim (ISU) Ch 2 P1 10 / 52

1 Introduction - Fisher information Definition 1 Score function: S(θ) = log L(θ) θ 2 Fisher information = curvature of the log-likelihood: I(θ) = 2 log L(θ) = θ θt θ S(θ) T 3 Observed (Fisher) information: I(ˆθ n) where ˆθ n is the MLE. 4 Expected (Fisher) information: I(θ) = E θ {I(θ)} The observed information is always positive. The observed information applies to a single dataset. The expected information is meaningful as a function of θ across the admissible values of θ. The expected information is an average quantity over all possible datasets. I(ˆθ) = I(ˆθ) for exponential family. Jae-Kwang Kim (ISU) Ch 2 P1 11 / 52

Properties of score function Theorem 2.3. Properties of score functions Under the regularity conditions allowing the exchange of the order of integration and differentiation, E θ {S(θ)} = 0 and V θ {S(θ)} = I(θ). Jae-Kwang Kim (ISU) Ch 2 P1 12 / 52

Proof Jae-Kwang Kim (ISU) Ch 2 P1 13 / 52

1 Introduction - Fisher information Remark Since ˆθ p θ 0, we can apply a Taylor linearization on S(ˆθ) = 0 to get ˆθ θ 0 = {I(θ0)} 1 S(θ 0). Here, we use the fact that I (θ) = S(θ)/ θ T converges in probability to I(θ). Thus, the (asymptotic) variance of MLE is V (ˆθ). = {I(θ 0)} 1 V {S(θ 0)} {I(θ 0)} 1 = {I(θ 0)} 1, where the last equality follows from Theorem 2.3. Jae-Kwang Kim (ISU) Ch 2 P1 14 / 52

Section 2: Observed likelihood 2. Observed likelihood Jae-Kwang Kim (ISU) Ch 2 P1 15 / 52

Basic Setup 1 Let y = (y 1, y 2, y n) be a realization of the random sample from an infinite population with density f (y; θ 0). 2 Let δ i be an indicator function defined by { 1 if yi is observed δ i = 0 otherwise and we assume that Pr (δ i = 1 y i ) = π (y i, φ) for some (unknown) parameter φ and π( ) is a known function. 3 Thus, instead of observing (δ i, y i ) directly, we observe (δ i, y i,obs ) { yi if δ i = 1 y i,obs = if δ i = 0. 4 What is the (marginal) density of (y i,obs, δ i )? Jae-Kwang Kim (ISU) Ch 2 P1 16 / 52

Motivation (Change of variable technique) 1 Suppose that z is a random variable with density f (z). 2 Instead of observing z directly, we observe only y = y (z), where the mapping z y (z) is known. 3 The density of y is where R (y) = {z; y (z) = y}. g (y) = R(y) f (z) dz Jae-Kwang Kim (ISU) Ch 2 P1 17 / 52

Derivation 1 The original distribution for z = (y, δ) is f (y, δ) = f 1 (y) f 2 (δ y) where f 1 (y) is the density of y and f 2 (δ y) is the conditional density of δ conditional on y and is given by f 2 (δ y) = {π(y)} δ {1 π(y)} 1 δ, where π(y) = Pr (δ = 1 y). 2 Instead of observing z = (y, δ) directly, we observe only (y obs, δ) where y obs = y obs (y, δ) and the mapping (y, δ) y obs is known. 3 The (marginal) density of (y obs, δ) is g (y obs, δ) = where R (y obs, δ) = {y; y obs (y, δ) = y obs }. R(y obs,δ) f (y, δ) dy Jae-Kwang Kim (ISU) Ch 2 P1 18 / 52

Likelihood under missing data (=Observed likelihood) The observed likelihood is the likelihood obtained from the marginal density of (y i,obs, δ i ), i = 1, 2,, n, and can be written as, under the IID setup, L obs (θ, φ) = [f 1 (y i ; θ) f 2 (δ i y i ; φ)] [ ] f 1 (y i ; θ) f 2 (δ i y i ; φ) dy i δ i =1 δ i =0 = [f 1 (y i ; θ) π (y i ; φ)] [ ] f 1 (y; θ) {1 π (y; φ)} dy, δ i =1 δ i =0 where π (y; φ) = P (δ = 1 y; φ). Jae-Kwang Kim (ISU) Ch 2 P1 19 / 52

Example 2.2 (Censored regression model, or Tobit model) z i = x i β + ɛ i y i = ɛ i N (0, σ 2) { zi if z i 0 0 if z i < 0. The observed log-likelihood is [ ] ( l β, σ 2) = 1 ln 2π + ln σ 2 + (y i x i β) 2 2 σ 2 y i >0 where Φ (x) is the cdf of the standard normal distribution. + [ ( )] x ln 1 Φ i β σ y i =0 Jae-Kwang Kim (ISU) Ch 2 P1 20 / 52

Example 2.3 Let t 1, t 2,, t n be an IID sample from a distribution with density f θ (t) = θe θt I (t > 0). Instead of observing t i, we observe (y i, δ i ) where { ti if δ i = 1 y i = c if δ i = 0 and δ i = { 1 if ti c 0 if t i > c, where c is a known censoring time. The observed likelihood for θ can be derived as L obs (θ) = n i=1 [ ] {f θ (t i )} δ i {P (t i > c)} 1 δ i = θ n i=1 δ i exp( θ n y i ). i=1 Jae-Kwang Kim (ISU) Ch 2 P1 21 / 52

Multivariate extension Basic Setup Let y = (y 1,..., y p) be a p-dimensional random vector with probability density function f (y; θ). { 1 if yij is observed Let δ ij be the response indicator function of y ij with δ ij = 0 otherwise. δ i = (δ i1,, δ ip ): p-dimensional random vector with density P(δ y) assuming P(δ y) = P(δ y; φ) for some φ. Let (y i,obs, y i,mis ) be the observed part and missing part of y i, respectively. Let R(y obs, δ) = {y; y obs (y i, δ i ) = y i,obs, i = 1,..., n} be the set of all possible values of y with the same realized value of y obs, for given δ, where y obs (y i, δ i ) is a function that gives the value of y ij for δ ij = 1. Jae-Kwang Kim (ISU) Ch 2 P1 22 / 52

Definition: Observed likelihood Under the above setup, the observed likelihood of (θ, φ) is L obs (θ, φ) = f (y; θ)p(δ y; φ)dy. R(y obs,δ) Under IID setup: The observed likelihood is n [ ] L obs (θ, φ) = f (y i ; θ)p(δ i y i ; φ)dy i,mis, i=1 where it is understood that, if y i = y i,obs and y i,mis is empty then there is nothing to integrate out. In the special case of scalar y, the observed likelihood is L obs (θ, φ) = [f (y i ; θ)π(y i ; φ)] [ ] f (y; θ) {1 π(y; φ)} dy, δ i =1 δ i =0 where π(y; φ) = P(δ = 1 y; φ). Jae-Kwang Kim (ISU) Ch 2 P1 23 / 52

Missing At Random Definition: Missing At Random (MAR) P(δ y) is the density of the conditional distribution of δ given y. Let y obs = y obs (y, δ) where { yi if δ i = 1 y i,obs = if δ i = 0. The response mechanism is MAR if P (δ y 1 ) = P (δ y 2 ) { or P(δ y) = P(δ y obs )} for all y 1 and y 2 satisfying y obs (y 1, δ) = y obs (y 2, δ). MAR: the response mechanism P(δ y) depends on y only through y obs. Let y = (y obs, y mis ). By Bayes theorem, P (y mis y obs, δ) = P(δ y mis, y obs ) P (y P(δ y obs ) mis y obs ). MAR: P(y mis y obs, δ) = P(y mis y obs ). That is, y mis δ y obs. MAR: the conditional independence of δ and y mis given y obs. Jae-Kwang Kim (ISU) Ch 2 P1 24 / 52

Remark MCAR (Missing Completely at random): P(δ y) does not depend on y. MAR (Missing at random): P(δ y) = P(δ y obs ) NMAR (Not Missing at random): P(δ y) P(δ y obs ) Thus, MCAR is a special case of MAR. Jae-Kwang Kim (ISU) Ch 2 P1 25 / 52

Likelihood factorization theorem Theorem 2.4 (Rubin, 1976) P φ (δ y) is the joint density of δ given y and f θ (y) is the joint density of y. Under conditions 1 the parameters θ and φ are distinct and 2 MAR condition holds, the observed likelihood can be written as L obs (θ, φ) = L 1(θ)L 2(φ), and the MLE of θ can be obtained by maximizing L 1(θ). Thus, we do not have to specify the model for response mechanism. The response mechanism is called ignorable if the above likelihood factorization holds. Jae-Kwang Kim (ISU) Ch 2 P1 26 / 52

Example 2.4 Bivariate data (x i, y i ) with pdf f (x, y) = f 1(y x)f 2(x). x i is always observed and y i is subject to missingness. Assume that the response status variable δ i of y i satisfies for some function Λ 1 ( ) of known form. P (δ i = 1 x i, y i ) = Λ 1 (φ 0 + φ 1x i + φ 2y i ) Let θ be the parameter of interest in the regression model f 1 (y x; θ). Let α be the parameter in the marginal distribution of x, denoted by f 2 (x i ; α). Define Λ 0(x) = 1 Λ 1(x). Three parameters θ: parameter of interest α and φ: nuisance parameter Jae-Kwang Kim (ISU) Ch 2 P1 27 / 52

Example 2.4 (Cont d) Observed likelihood L obs (θ, α, φ) = f 1 (y i x i ; θ) f 2 (x i ; α) Λ 1 (φ 0 + φ 1x i + φ 2y i ) δi =1 δi =0 f 1 (y x i ; θ) f 2 (x i ; α) Λ 0 (φ 0 + φ 1x i + φ 2y) dy = L 1 (θ, φ) L 2 (α) where L 2 (α) = n i=1 f2 (x i; α). Thus, we can safely ignore the marginal distribution of x if x is completely observed. Jae-Kwang Kim (ISU) Ch 2 P1 28 / 52

Example 2.4 (Cont d) If φ 2 = 0, then MAR holds and L 1 (θ, φ) = L 1a (θ) L 1b (φ) where and L 1a (θ) = f 1 (y i x i ; θ) δ i =1 L 1b (φ) = Λ 1 (φ 0 + φ 1x i ) Λ 0 (φ 0 + φ 1x i ). δ i =1 δ i =0 Thus, under MAR, the MLE of θ can be obtained by maximizing L 1a (θ), which is obtained by ignoring the missing part of the data. Jae-Kwang Kim (ISU) Ch 2 P1 29 / 52

Example 2.4 (Cont d) Instead of y i subject to missingness, if x i is subject to missingness, then the observed likelihood becomes L obs (θ, φ, α) = f 1 (y i x i ; θ) f 2 (x i ; α) Λ 1 (φ 0 + φ 1x i + φ 2y i ) δi =1 δi =0 f 1 (y i x; θ) f 2 (x; α) Λ 0 (φ 0 + φ 1x + φ 2y i ) dx If φ 1 = 0 then L 1 (θ, φ) L 2 (α). L obs (θ, α, φ) = L 1 (θ, α) L 2 (φ) and MAR holds. Although we are not interested in the marginal distribution of x, we still need to specify the model for the marginal distribution of x. Jae-Kwang Kim (ISU) Ch 2 P1 30 / 52

Chapter 2: Part 2 3. Mean Score Approach Jae-Kwang Kim (ISU) Ch 2 P1 31 / 52

3 Mean Score Approach The observed likelihood is the marginal density of (y obs, δ). The observed likelihood is L obs (η) = f (y; θ)p(δ y; φ)dµ(y) = R(y obs,δ) f (y; θ)p(δ y; φ)dµ(y mis ) where y mis is the missing part of y and η = (θ, φ). Observed score equation: S obs (η) η log L obs(η) = 0 Computing the observed score function can be computationally challenging because the observed likelihood is an integral form. Jae-Kwang Kim (ISU) Ch 2 P1 32 / 52

3 Mean Score Approach Theorem 2.5: Mean Score Theorem (Fisher, 1922) Under some regularity conditions, the observed score function equals to the mean score function. That is, S obs (η) = S(η) where S(η) = E{S com(η) y obs, δ} S com(η) = log f (y, δ; η), η f (y, δ; η) = f (y; θ)p(δ y; φ). The mean score function is computed by taking the conditional expectation of the complete-sample score function given the observation. The mean score function is easier to compute than the observed score function. Jae-Kwang Kim (ISU) Ch 2 P1 33 / 52

3 Mean Score Approach Proof of Theorem 2.5 Since L obs (η) = f (y, δ; η) /f (y, δ y obs, δ; η), we have η ln L obs (η) = ln f (y, δ; η) η η ln f (y, δ y obs, δ; η), taking a conditional expectation of the above equation over the conditional distribution of (y, δ) given (y obs, δ), we have { } η ln L obs (η) = E η ln L obs (η) y obs, δ { } = E {S com (η) y obs, δ} E η ln f (y, δ y obs, δ; η) y obs, δ. Here, the first equality holds because L obs (η) is a function of (y obs, δ) only. The last term is equal to zero by Theorem 2.3, which states that the expected value of the score function is zero and the reference distribution in this case is the conditional distribution of (y, δ) given (y obs, δ). Jae-Kwang Kim (ISU) Ch 2 P1 34 / 52

Example 2.6 1 Suppose that the study variable y follows from a normal distribution with mean x β and variance σ 2. The score equations for β and σ 2 under complete response are S 1(β, σ 2 ) = n ( yi x i β ) x i /σ 2 = 0 i=1 and S 2(β, σ 2 ) = n/(2σ 2 ) + n ( yi x i β )2 /(2σ 4 ) = 0. 2 Assume that y i are observed only for the first r elements and the MAR assumption holds. In this case, the mean score function reduces to S 1(β, σ 2 ) = i=1 r ( yi x i β ) x i /σ 2 i=1 and r S 2(β, σ 2 ) = n/(2σ 2 ( ) + yi x i β )2 /(2σ 4 ) + (n r)/(2σ 2 ). i=1 Jae-Kwang Kim (ISU) Ch 2 P1 35 / 52

Example 2.6 (Cont d) 3 The maximum likelihood estimator obtained by solving the mean score equations is ( r ) 1 r ˆβ = x i x i x i y i and ˆσ 2 = 1 r i=1 r i=1 i=1 ( y i x i ˆβ) 2. Thus, the resulting estimators can be also obtained by simply ignoring the missing part of the sample, which is consistent with the result in Example 2.4 (for φ 2 = 0). Jae-Kwang Kim (ISU) Ch 2 P1 36 / 52

Discussion of Example 2.6 We are interested in estimating θ for the conditional density f (y x; θ). Under MAR, the observed likelihood for θ is r n L obs (θ) = f (y i x i ; θ) f (y x i ; θ)dy = i=1 i=r+1 r f (y i x i ; θ). The same conclusion can follow from the mean score theorem. Under MAR, the mean score function is r n S(θ) = S(θ; x i, y i ) + E{S(θ; x i, Y ) x i } = i=1 r S(θ; x i, y i ) i=1 i=r+1 where S(θ; x, y) is the score function for θ and the second equality follows from Theorem 2.3. i=1 Jae-Kwang Kim (ISU) Ch 2 P1 37 / 52

Example 2.5 1 Suppose that the study variable y is randomly distributed with Bernoulli distribution with probability of success p i, where p i = p i (β) = exp (x i β) 1 + exp (x i β) for some unknown parameter β and x i is a vector of the covariates in the logistic regression model for y i. We assume that 1 is in the column space of x i. 2 Under complete response, the score function for β is S 1(β) = n (y i p i (β)) x i. i=1 Jae-Kwang Kim (ISU) Ch 2 P1 38 / 52

Example 2.5 (Cont d) 3 Let δ i be the response indicator function for y i with distribution Bernoulli(π i ) where π i = exp (x i φ 0 + y i φ 1) 1 + exp (x i φ0 + y iφ 1). We assume that x i is always observed, but y i is missing if δ i = 0. 4 Under missing data, the mean score function for β is S 1 (β, φ) = δi =1 {y i p i (β)} x i + δ i =0 1 w i (y; β, φ) {y p i (β)} x i, where w i (y; β, φ) is the conditional probability of y i = y given x i and δ i = 0: y=0 w i (y; β, φ) = P β (y i = y x i ) P φ (δ i = 0 y i = y, x i ) 1 z=0 P β (y i = z x i ) P φ (δ i = 0 y i = z, x i ) Thus, S 1 (β, φ) is also a function of φ. Jae-Kwang Kim (ISU) Ch 2 P1 39 / 52

Example 2.5 (Cont d) 5 If the response mechanism is MAR so that φ 1 = 0, then w i (y; β, φ) = P β (y i = y x i ) 1 z=0 P β (y i = z x i ) = P β (y i = y x i ) and so S 1 (β, φ) = δi =1 {y i p i (β)} x i = S 1 (β). 6 If MAR does not hold, then ( ˆβ, ˆφ) can be obtained by solving S 1 (β, φ) = 0 and S 2(β, φ) = 0 jointly, where S 2 (β, φ) = δi =1 {δ i π (φ; x i, y i )} (x i, y i ) + δ i =0 1 w i (y; β, φ) {δ i π i (φ; x i, y)} (x i, y). y=0 Jae-Kwang Kim (ISU) Ch 2 P1 40 / 52

Discussion of Example 2.5 We may not have a unique solution to S(η) = 0, where S(η) = [ S1 (β, φ), S 2(β, φ) ] when MAR does not hold, because of the non-identifiability problem associated with non-ignorable missing. To avoid this problem, often a reduced model is used for the response model. Pr (δ = 1 x, y) = Pr (δ = 1 u, y) where x = (u, z). The reduced response model introduces a smaller set of parameters and the over-identified situation can be resolved. (More discussion will be made in Chapter 6.) Computing the solution to S(η) is also difficult. EM algorithm, which will be discussed in Chapter 3, is a useful computational tool. Jae-Kwang Kim (ISU) Ch 2 P1 41 / 52

2.4 Observed information Jae-Kwang Kim (ISU) Ch 2 P1 42 / 52

4 Observed information Definition 1 Observed score function: S obs (η) = η log L obs(η) 2 Fisher information from observed likelihood: I obs (η) = 2 log L η η T obs (η) 3 Expected (Fisher) information from observed likelihood: I obs (η) = E η {I obs (η)}. Theorem 2.6 Under regularity conditions, E{S obs (η)} = 0, and V{S obs (η)} = I obs (η), where I obs (η) = E η{i obs (η)} is the expected information from the observed likelihood. Jae-Kwang Kim (ISU) Ch 2 P1 43 / 52

4 Observed information Under missing data, the MLE ˆη is the solution to S(η) = 0. Under some regularity conditions, ˆη converges in probability to η 0 and has the asymptotic variance {I obs (η 0)} 1 with { I obs (η) = E } { } { } η S obs(η) = E S 2 T obs (η) = E S 2 (η) & B 2 = BB T. For variance estimation of ˆη, may use {I obs (ˆη)} 1. In general, I obs (ˆη) is preferred to I obs (ˆη) for variance estimation of ˆη. Jae-Kwang Kim (ISU) Ch 2 P1 44 / 52

Return to Example 2.3 Observed log-likelihood ln L obs (θ) = n δ i log(θ) θ n i=1 i=1 y i MLE for θ: ˆθ = n i=1 y i n i=1 δ i Fisher information: I obs (θ) = n i=1 δ i/θ 2 Expected information: I obs (θ) = n i=1 (1 e θc )/θ 2 = n(1 e θc )/θ 2. Which one do you prefer? Jae-Kwang Kim (ISU) Ch 2 P1 45 / 52

4 Observed information Motivation L com(η) = f (y, δ; η): complete-sample likelihood with no missing data Fisher information associated with L com(η): I com(η) = η Scom(η) = 2 log Lcom(η) T η ηt L obs (η): the observed likelihood Fisher information associated with L obs (η): I obs (η) = η S obs(η) = 2 T η η log L obs(η) T How to express I obs (η) in terms of I com(η) and S com(η)? Jae-Kwang Kim (ISU) Ch 2 P1 46 / 52

Louis formula Theorem 2.7 (Louis, 1982; Oakes, 1999) Under regularity conditions allowing the exchange of the order of integration and differentiation, [ I obs (η) = E{I com(η) y obs, δ} E{Scom(η) y 2 obs, δ} S(η) 2] where S(η) = E{S com(η) y obs, δ}. = E{I com(η) y obs, δ} V {S com(η) y obs, δ}, Jae-Kwang Kim (ISU) Ch 2 P1 47 / 52

Proof of Theorem 2.7 By Theorem 2.5, the observed information associated with L obs (η) can be expressed as I obs (η) = η S (η) where S (η) = E {S com(η) y obs, δ; η}. Thus, we have η S (η) = S com(η; y)f (y, δ y η obs, δ; η) dµ(y) { } = Scom(η; y) f (y, δ y η obs, δ; η) dµ(y) { } + S com(η; y) η f (y, δ y obs, δ; η) dµ(y) = E { S com(η)/ η y obs, δ } { } + S com(η; y) η log f (y, δ y obs, δ; η) f (y, δ y obs, δ; η) dµ(y). The first term is equal to E {I com(η) y obs, δ} and the second term is equal to E { S com(η)s mis(η) y obs, δ } = E [{ S(η) } + Smis(η) S mis(η) y obs, δ ] = E { S mis(η)s mis(η) y obs, δ } because E { S(η)Smis(η) y obs, δ } = 0. Jae-Kwang Kim (ISU) Ch 2 P1 48 / 52

Missing Information Principle S mis (η): the score function with the conditional density f (y, δ y obs, δ). { } Expected missing information: I mis (η) = E S η T mis (η) I mis (η) = E { S mis (η) 2}. Missing information principle (Orchard and Woodbury, 1972): I mis (η) = I com(η) I obs (η), satisfying where I com(η) = E { S com(η)/ η T } is the expected information with complete-sample likelihood. An alternative expression of the missing information principle is V{S mis (η)} = V{S com(η)} V{ S(η)}. Note that V{S com(η)} = I com(η) and V{S obs (η)} = I obs (η). Jae-Kwang Kim (ISU) Ch 2 P1 49 / 52

Example 2.7 1 Consider the following bivariate normal distribution: ( ) [( ) ( y1i µ1 σ11 σ 12 N, µ 2 σ 12 σ 22 y 2i )], for i = 1, 2,, n. Assume for simplicity that σ 11, σ 12 and σ 22 are known constants and µ = (µ 1, µ 2) be the parameter of interest. 2 The complete sample score function for µ is n n ( ) 1 ( S com (µ) = S com (i) σ11 σ 12 y1i µ 1 (µ) = σ 12 σ 22 y 2i µ 2 i=1 The information matrix of µ based on the complete sample is ( ) 1 σ11 σ 12 I com (µ) = n. σ 12 σ 22 i=1 ). Jae-Kwang Kim (ISU) Ch 2 P1 50 / 52

Example 2.7 (Cont d) 3 Suppose that there are some missing values in y 1i and y 2i and the original sample is partitioned into four sets: H = both y 1 and y 2 respond K = only y 1 is observed L = only y 2 is observed M = both y 1 and y 2 are missing. Let n H, n K, n L, n M represent the size of H, K, L, M, respectively. 4 Assume that the response mechanism does not depend on the value of (y 1, y 2) and so it is MAR. In this case, the observed score function of µ based on a single observation in set K is E { } S com (i) (µ) y 1i, i K ( ) 1 ( ) σ11 σ 12 y1i µ 1 = σ 12 σ 22 E(y 2i y 1i ) µ 2 ( σ 1 11 = (y ) 1i µ 1). 0 Jae-Kwang Kim (ISU) Ch 2 P1 51 / 52

Example 2.7 (Cont d) 5 Similarly, we have E { } S com (i) (µ) y 2i, i L = ( 0 σ 1 22 (y 2i µ 2) ). 6 Therefore, the observed information matrix of µ is I obs (µ) = n H ( σ11 σ 12 σ 12 σ 22 ) 1 + n K ( σ 1 11 0 0 0 ) + n L ( 0 0 0 σ 1 22 and the asymptotic variance of the MLE of µ can be obtained by the inverse of I obs (µ). ) Jae-Kwang Kim (ISU) Ch 2 P1 52 / 52

REFERENCES Fisher, R. A. (1922), On the mathematical foundations of theoretical statistics, Philosophical Transactions of the Royal Society of London A 222, 309 368. Louis, T. A. (1982), Finding the observed information matrix when using the EM algorithm, Journal of the Royal Statistical Society: Series B 44, 226 233. Oakes, D. (1999), Direct calculation of the information matrix via the em algorithm, Journal of the Royal Statistical Society: Series B 61, 479 482. Orchard, T. and M.A. Woodbury (1972), A missing information principle: theory and applications, in Proceedings of the 6th Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, University of California Press, Berkeley, California, pp. 697 715. Rubin, D. B. (1976), Inference and missing data, Biometrika 63, 581 592. Jae-Kwang Kim (ISU) Ch 2 P1 52 / 52