Various types of likelihood

Similar documents
Recap. Vector observation: Y f (y; θ), Y Y R m, θ R d. sample of independent vectors y 1,..., y n. pairwise log-likelihood n m. weights are often 1

Composite likelihood methods

Nancy Reid SS 6002A Office Hours by appointment

Composite Likelihood

Nancy Reid SS 6002A Office Hours by appointment

Composite Likelihood Estimation

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Various types of likelihood

Review and continuation from last week Properties of MLEs

Likelihood and Asymptotic Theory for Statistical Inference

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Likelihood and p-value functions in the composite likelihood context

Likelihood and Asymptotic Theory for Statistical Inference

Likelihood Inference in the Presence of Nuisance Parameters

Likelihood Inference in the Presence of Nuisance Parameters

Likelihood inference in the presence of nuisance parameters

Chapter 3: Maximum Likelihood Theory

Approximate Likelihoods

Nuisance parameters and their treatment

New Bayesian methods for model comparison

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Aspects of Composite Likelihood Inference. Zi Jin

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Applied Asymptotics Case studies in higher order inference

Maximum Likelihood Estimation

Likelihood based Statistical Inference. Dottorato in Economia e Finanza Dipartimento di Scienze Economiche Univ. di Verona

Statistics 3858 : Maximum Likelihood Estimators

High Dimensional Empirical Likelihood for Generalized Estimating Equations with Dependent Data

A Very Brief Summary of Statistical Inference, and Examples

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

Introduction to Empirical Processes and Semiparametric Inference Lecture 25: Semiparametric Models

Empirical Likelihood

Outline of GLMs. Definitions

Basic math for biology

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bootstrap and Parametric Inference: Successes and Challenges

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Aspects of Likelihood Inference

ST495: Survival Analysis: Maximum likelihood

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Bayesian Analysis of Risk for Data Mining Based on Empirical Likelihood

Bayesian Asymptotics

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Density Estimation. Seungjin Choi

Recent Advances in the analysis of missing data with non-ignorable missingness

DEFNITIVE TESTING OF AN INTEREST PARAMETER: USING PARAMETER CONTINUITY

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Probabilistic Graphical Models

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Graduate Econometrics I: Maximum Likelihood I

Empirical Likelihood Methods for Sample Survey Data: An Overview

Information in Data. Sufficiency, Ancillarity, Minimality, and Completeness

MIT Spring 2016

An Efficient Estimation Method for Longitudinal Surveys with Monotone Missing Data

Generalized Linear Models Introduction

More on nuisance parameters

Lecture 3. Truncation, length-bias and prevalence sampling

Semiparametric posterior limits

Lecture 35: December The fundamental statistical distances

Last week. posterior marginal density. exact conditional density. LTCC Likelihood Theory Week 3 November 19, /36

Foundations of Statistical Inference

The formal relationship between analytic and bootstrap approaches to parametric inference

ASYMPTOTICS AND THE THEORY OF INFERENCE

Better Bootstrap Confidence Intervals

Statistics Ph.D. Qualifying Exam: Part I October 18, 2003

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

Math 152. Rumbos Fall Solutions to Assignment #12

Modeling Real Estate Data using Quantile Regression

Master s Written Examination

Likelihood-based inference with missing data under missing-at-random

A Very Brief Summary of Bayesian Inference, and Examples

Introduction An approximated EM algorithm Simulation studies Discussion

Bayesian Interpretations of Regularization

Calibration Estimation of Semiparametric Copula Models with Data Missing at Random

Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam. aad. Bayesian Adaptation p. 1/4

Graduate Econometrics I: Maximum Likelihood II

A Model for Correlated Paired Comparison Data

Introduction to Estimation Methods for Time Series models Lecture 2

Likelihood inference for complex data

Statement: With my signature I confirm that the solutions are the product of my own work. Name: Signature:.

STAT 512 sp 2018 Summary Sheet

Lecture 1: Bayesian Framework Basics

1. Fisher Information

i=1 h n (ˆθ n ) = 0. (2)

Minimax lower bounds I

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

ANCILLARY STATISTICS: A REVIEW

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Principles of Statistics

Chapter 4: Imputation

Test of Association between Two Ordinal Variables while Adjusting for Covariates

Machine learning - HT Maximum Likelihood

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

A union of Bayesian, frequentist and fiducial inferences by confidence distribution and artificial data sampling

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Default priors and model parametrization

Transcription:

Various types of likelihood 1. likelihood, marginal likelihood, conditional likelihood, profile likelihood, adjusted profile likelihood 2. semi-parametric likelihood, partial likelihood 3. empirical likelihood, penalized likelihood 4. quasi-likelihood, composite likelihood 5. simulated likelihood, indirect inference 6. bootstrap likelihood, h-likelihood, weighted likelihood, pseudo-likelihood, local likelihood, sieve likelihood STA 4508 November 6 2018 1

November 6 HW 2 comments & K-L divergence presentations semi-parametric likelihood as profile empirical likelihood composite likelihood STA 4508 November 6 2018 2

Exercises October 23 STA 4508S (Fall, 2018) 1. The Kullback-Leibler divergence from the distribution G to the distribution F is given by KL(F : G) = log f(y) f(y)dy, (1) g(y) where f and g are and density functions with respect to Lebesgue measure. Note that the divergence is not symmetric in its arguments. This is called the directed information distance in Barndorff-Nielsen and Cox (1994) where the more general definition KL(F : G) = log(df/dg)df is used, assuming F and G are mutually absolutely continuous. (a) In the canonical exponential family model with density f(s; ϕ) = exp{ϕ T s k(ϕ)}h(s), s R p, find an expression for the KL divergence between the model with parameter ϕ1 and that with parameter ϕ2. (b) Show that for a sample of observations from a model with density f(y;θ) the maximum likelihood estimator minimizes the KL divergence from F ( ; θ) to Gn( ), where Gn( ) is the empirical distribution function putting mass 1/n at each observation yi. 2. Suppose yi N(µi, 1/n), i = 1,..., k and ψ 2 = Σ k i=1µ 2 i is the parameter of interest. 1 (a) Show that the marginal posterior density for nψ 2, assuming a flat prior π(µ) 1, is a non-central χ 2 k distribution, with noncentrality parameter nσyi 2. (b) Show that the maximum likelihood estimate of ψ 2 is ˆψ 2 = Σyi 2, and that n ˆψ 2 has a non-central χ 2 k distribution with non-centrality parameter nψ 2. (c) Compare the normal approximations to ru(ψ), re(ψ) and r(ψ) with the exact distribution of the maximum likelihood estimate. (d) Compare the 95% Bayesian posterior probability interval for ψ 2, based on (a) to the 95% confidence interval for ψ 2, based on (b). 1 It will be convenient to use λi = µi/ Σµ 2 i 1 for the nuisance parameters.

Proportional hazards regression partial log-likelihood function y 1 < y 2 < < y n l part (β; y, d) = n d i {xi T β log exp(xi T β)} j Ri i=1 can be motivated as: R i = {j : y j y i } set of individuals that could be observed to fail at time y i see SM 10.8 for treatment of ties 1. marginal log-likelihood of the ranks of the failure times 2. n Pr(unit i fails at y i R i, there is one failure at y i ) i=1 3. profile log-likelihood function if λ( ) is represented by a vector of values (λ 1,..., λ n) = {λ(y 1),... λ(y n)} STA 4508 November 6 2018 4

Inference n l p( θ n) = l p(θ 0) + ( θ n θ 0) T Ũ j (θ 0) 1 2 n( θ n θ 0) T ĩ 1 (θ 0)( θ n θ 0) j=1 + o p( n θ n θ 0 + 1) 2 Ũ is the projection of l/ θ on space spanned by nuisance function as in parametric models, lead to (ˆθ θ 0). N{0, ĩ 1 (θ 0)} and likelihood ratio test 2{l p(ˆθ) l p(θ 0)}. χ 2 d proof uses least favourable sub-models through the true model effectively turns infinite-dimensional parameter finite STA 4508 November 6 2018 5

Infinite-dimensional models recall that L(θ; y) f(y; θ) f(y; θ) a density w.r. to dominating measure more abstract definition: if a probability measure Q is absolutely continuous w.r. to a probability measure P, and both possess densities w.r. to a measure µ, then the likelihood of Q w.r. to P is the Radon-Nikodym derivative dq dp = q p, a.e. P some semi-parametric models have a dominating measure, and a family of densities some can be handled by the notion of empirical likelihood some may use mixtures of these STA 4508 November 6 2018 6

... infinite-dimensional models Definition: Given a measure P, and a sample (y 1,..., y n ), the empirical likelihood function is EL(P; y) = n P({y i }), where P{y} is the measure of the one-point set {y} i=1 Definition: Given a model P, a maximum likelihood estimator is the distribution P that maximizes the empirical likelihood over P may or may not exist STA 4508 November 6 2018 7

Example: the empirical distribution vdw 25.68 P is the set of all probability distributions on a measurable space {Y, A} suppose the observed values y 1,..., y n are distinct {(P{y 1 },..., P({y n }); P P} (p 1,..., p n ), p i 0, Σp i = 1) 1-point sets are measurable empirical likelihood maximized at ( 1 n,..., 1 n ) empirical distribution function is the nonparametric MLE F n( ) = n 1 Σ1(Y i ) EL is not the same as f(y i ), even if P has a density f STA 4508 November 6 2018 8

Compare Owen, Ch. 2 for y R, define F(y) = Pr(Y y) and F(y ) = Pr(Y < y) for y 1,..., y n the nonparametric likelihood function is n L(F) = {F(y i ) F(y )}, i hence 0 if F is continuous Theorem 2.1 of Owen: i=1 L(F) < L(F n ), F n (y) = 1 n Σ1{y i y} there is a likelihood function on the space of distribution functions for which the empirical c.d.f. is the maximum likelihood estimator why does this fail for densities? STA 4508 November 6 2018 9

Aside: empirical likelihood Owen, Ch.2 profile version of empirical likelihood { } L(F) R(θ) = sup F F, T(F) = θ L(F n ) R a relative likelihood, hence np i example: T(F) = xdf(x) R(θ) = max{ n i=1 np i n i=1 p iy i = θ, p i 0, p i = 1} For y 1,..., y n i.i.d. F 0, E(y i ) = θ 0, var(y i ) <, 2 log R(θ 0 ) d χ 2 1, n ˆp i = 1 1 n {1 + α(y i θ 0 )}, 1 n n y i θ 0 1 + α(y i θ 0 ) = 0 Theorem 2.2 Owen i=1 STA 4508 November 6 2018 10

Semi-parametric logistic regression VandW Ex.25.71 eθv+η(w) Pr(Y = 1 V, W) = 1 + e θv+η(w) sample (Y i, V i, W i ), i = 1,..., n independent L(θ, η; Y) n { e θv i +η(w i ) } yi { 1 1 + e θv i+η(w i ) 1 + e θv i+η(w i ) i=1 η(w i ) = when y i = 1, η(w i ) = when y i = 0 gives L(θ, η) } 1 yi suggestion: penalized log-likelihood log L(θ, η; Y) ˆα n 2 {η (k) (w)} 2 dw we can t maximize it needs separate analysis of properties STA 4508 November 6 2018 11

Example: missing covariate Murphy and vdw, 2000 observation (D, W, Z); D and W are independent, given Z Pr(D = 0) = {1 + exp(γ + βe z )} 1 W N(α 0 + α 1 z; σ 2 ) Z g( ), non-parametric (d C, w C, z C ) a complete observation Z is the gold standard covariate, e.g. LDL cholesterol (d R, w R ) has a missing covariate W is a surrogate for Z f(x; θ, g) = f(d C, w C z C ; θ)g(z C ) f(d R, w R z; θ)g(z)dz EL(θ, g) = f(d C, w C z C ; θ)g{z C } x = (d C, w C, z C, d R, w R ) f(d R, w R z)g(z)dz θ = γ, β, α 0, α 1, σ 2 can be profiled, according to M&vdW STA 4508 November 6 2018 12

Various types of likelihood 1. likelihood, marginal likelihood, conditional likelihood, profile likelihood, adjusted profile likelihood 2. semi-parametric likelihood, partial likelihood 3. empirical likelihood, penalized likelihood 4. quasi-likelihood, composite likelihood 5. simulated likelihood, indirect inference 6. bootstrap likelihood, h-likelihood, weighted likelihood, pseudo-likelihood, local likelihood, sieve likelihood STA 4508 November 6 2018 13

Composite likelihood Vector observation: Y f(y; θ), Y Y R m, θ R d Set of events: {A k, k K} Composite Log-Likelihood: Lindsay, 1988 cl(θ; y) = k K w k l k (θ; y) l k (θ; y) = log{f({y A k }; θ)} log-likelihood for an event {w k, k K} a set of weights also called: pseudo-likelihood (spatial modelling) quasi-likelihood (econometrics) limited information method (psychometrics) STA 4508 November 6 2018 14

Examples of composite log-likelihood m r=1 w r log f 1 (y r ; θ) m r=1 s>r w rs log f 2 (y r, y s ; θ) m r=1 w r log f(y r y ( r) ; θ) m r=1 s>r w rs log f(y r y s ; θ) m r=1 w r log f(y r y r 1 ; θ) m r=1 w r log f(y r neighbours of y r ; θ) Independence Pairwise Conditional All pairs conditional Time series Spatial Small blocks of observations; pairwise differences;... your favourite combination... STA 4508 November 6 2018 15

Derived quantities single response y with density f(y; θ), y R m, θ R d composite log-likelihood cl(θ; y) = log cl(θ; y) = k w kl k (θ; y) composite score function U CL (θ) = cl(θ; y)/ θ sensitivity H(θ) = E θ { 2 cl(θ; y)/ θ θ T } variability J(θ) = E θ {U CL (θ)u(θ) T } Godambe information G(θ) = H(θ)J 1 (θ)h(θ) STA 4508 November 6 2018 16

... derived quantities sample y = (y 1,..., y n ) with joint density f(y; θ), y R m, θ R d score function U CL (θ) = θ cl(θ; y) = n i=1 θ cl(θ; y i) maximum composite ˆθ CL = ˆθ CL (y) = arg sup θ cl(θ; y) likelihood estimate score equation U CL (ˆθ CL ) = cl (ˆθ CL ) = 0 composite LRT Godambe information w CL (θ) = 2{cl(ˆθ CL ) cl(θ)} G(θ) = G n (θ) = H n (θ)j 1 n (θ)h n (θ) = O(n) STA 4508 November 6 2018 17

Inference Sample: Y 1,..., Y n, i.i.d., CL(θ; y) = n i=1 CL(θ; y i) ˆθ CL θ. N{0, G 1 (θ)} G n (θ) = H(θ)J(θ) 1 H(θ) U(ˆθ CL ). = U(θ) + (ˆθ CL θ) θ U(θ) U = U CL ˆθ CL θ. = θ U(θ) 1 U(θ). = H 1 (θ)u(θ) U(θ). N{0, J(θ)} H 1 (θ)u(θ). N{0, H 1 (θ)j(θ)h T (θ)} conclude n(ˆθ CL θ). N{0, G 1 (θ)} STA 4508 November 6 2018 18

... inference w(θ) = 2{cl(ˆθ CL ) cl(θ)}. d a=1 µ az 2 a Z a N(0, 1) µ 1,..., µ d eigenvalues of J(θ)H(θ) 1 cl(ˆθ CL ) cl(θ) =. 1 2 (ˆθ CL θ) T { cl (ˆθ CL )}(ˆθ CL θ) non-central χ 2 limit J(θ) = varu(θ), H(θ) = E θ U(θ) if J(θ) = H(θ), w(θ). χ 2 d if d = 1, w(θ). µ 1 χ 2 1 = J(θ)H 1 (θ)χ 2 1 H, J both scalars STA 4508 November 6 2018 19

Example: symmetric normal Y i N(0, R), var(y ir ) = 1, corr (Y ir, Y is ) = ρ compound bivariate normal densities to form pairwise likelihood nm(m 1) cl(ρ; y 1,..., y n ) = log(1 ρ 2 ) m 1 + ρ 4 2(1 ρ 2 ) SS w (m 1)(1 ρ) SS b 2(1 ρ 2 ) m n m n SS w = (y is ȳ i. ) 2, SS b = i=1 s=1 n(m 1) l(ρ; y 1,..., y n ) = log(1 ρ) n log{1 + (m 1)ρ} 2 2 1 2(1 ρ) SS 1 SS b w 2{1 + (m 1)ρ} m i=1 y 2 i. STA 4508 November 6 2018 20

... symmetric normal a. var(ˆρ) = a. var(ˆρ CL ) = 2 {1 + (m 1)ρ} 2 (1 ρ) 2 nm(m 1) 1 + (m 1)ρ 2 2 (1 ρ) 2 c(m, ρ) nm(m 1) (1 + ρ 2 ) 2 c(m, ρ) = (1 ρ) 2 (3ρ 2 + 1) + mρ( 3ρ 3 + 8ρ 2 3ρ + 2) + m 2 ρ 2 (1 ρ) 2 2 (1 ρ) 2 a.var(ˆρ CL ) = nm(m 1) (1 + ρ 2 c(m, ρ) ) 2 O( 1 n ) O(1) n m STA 4508 November 6 2018 21

... symmetric normal a.var(ˆρ ), m = 3, 5, 8, 10 a.var(ˆρ CL ) (Cox & Reid, 2004) efficiency 0.85 0.90 0.95 1.00 0.0 0.2 0.4 0.6 0.8 1.0 ρ STA 4508 November 6 2018 22

Likelihood ratio test log likelihoods 30 20 10 0 rho=0.5, n=10, q=5 log likelihoods 40 30 20 10 0 rho=0.8, n=10, q=5 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 rho rho log likelihoods 100 60 40 20 0 rho=0.2, n=10, q=5 log likelihoods 70 50 30 10 0 rho=0.2, n=7, q=5 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 rho rho STA 4508 November 6 2018 23

Example: longitudinal count data subjects i = 1,..., n observations counts y ir, r = 1,... m i model y ir Poisson(u ir x T ir β) u i1,..., u imi gamma-distributed random effects but correlated corr(u ir, u is ) = ρ r s joint density has combinatorial number of terms in m i ; impractical weighted pairwise composite likelihood L pair (β) = n i=1 1 m i 1 m i m i r=1 s=r+1 f(y ir, y is ; β) weights chosen so that L pair = full likelihood if ρ = 0 Henderson & Shimura, 2003 STA 4508 November 6 2018 24