More on nuisance parameters

Similar documents
Nuisance parameters and their treatment

Model comparison and selection

COS513 LECTURE 8 STATISTICAL CONCEPTS

An Extended BIC for Model Selection

CSC321 Lecture 18: Learning Probabilistic Models

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

A Very Brief Summary of Bayesian Inference, and Examples

Linear Models A linear model is defined by the expression

Sequential Bayesian Updating

F & B Approaches to a simple model

Classical and Bayesian inference

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Part III. A Decision-Theoretic Approach and Bayesian testing

Bayesian Asymptotics

the unification of statistics its uses in practice and its role in Objective Bayesian Analysis:

A Very Brief Summary of Statistical Inference, and Examples

Lecture 7 October 13

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Stat 579: Generalized Linear Models and Extensions

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Lecture 23 Maximum Likelihood Estimation and Bayesian Inference

Bayesian Model Comparison

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Remarks on Improper Ignorance Priors

Measuring nuisance parameter effects in Bayesian inference

Foundations of Statistical Inference

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Parametric Techniques Lecture 3

MATH5745 Multivariate Methods Lecture 07

Likelihood inference in the presence of nuisance parameters

Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D.

Introduction to Machine Learning. Lecture 2

Chapter 11. Hypothesis Testing (II)

Statistics and Econometrics I

Some Curiosities Arising in Objective Bayesian Analysis

Introduction to Empirical Processes and Semiparametric Inference Lecture 25: Semiparametric Models

Bootstrap and Parametric Inference: Successes and Challenges

Chapter 17: Undirected Graphical Models

Bayesian and frequentist inference

Machine Learning 4771

One-parameter models

Primer on statistics:

Foundations of Statistical Inference

Interval estimation. October 3, Basic ideas CLT and CI CI for a population mean CI for a population proportion CI for a Normal mean

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Bayesian statistics, simulation and software

1 General problem. 2 Terminalogy. Estimation. Estimate θ. (Pick a plausible distribution from family. ) Or estimate τ = τ(θ).

Summary of Extending the Rank Likelihood for Semiparametric Copula Estimation, by Peter Hoff

Statistical Data Analysis Stat 3: p-values, parameter estimation

A simple analysis of the exact probability matching prior in the location-scale model

Minimum Message Length Analysis of the Behrens Fisher Problem

Economics 520. Lecture Note 19: Hypothesis Testing via the Neyman-Pearson Lemma CB 8.1,

Lecture 20 May 18, Empirical Bayes Interpretation [Efron & Morris 1973]

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

10. Exchangeability and hierarchical models Objective. Recommended reading

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring

STAT 730 Chapter 4: Estimation

Parameter Estimation

Inferences about Parameters of Trivariate Normal Distribution with Missing Data

Parametric Techniques

Inverse Wishart Distribution and Conjugate Bayesian Analysis

Lecture 2: Linear Models. Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011

SOLUTION FOR HOMEWORK 6, STAT 6331

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

1 Mixed effect models and longitudinal data analysis

Lecture : Probabilistic Machine Learning

STAT 830 Bayesian Estimation

Exam 2. Jeremy Morris. March 23, 2006

Decomposable Graphical Gaussian Models

Default priors and model parametrization

Last week. posterior marginal density. exact conditional density. LTCC Likelihood Theory Week 3 November 19, /36

Bayesian Inference. Chapter 9. Linear models and regression

Bayesian SAE using Complex Survey Data Lecture 4A: Hierarchical Spatial Bayes Modeling

Principles of Statistics

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Introduction to General and Generalized Linear Models

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

1 Hypothesis Testing and Model Selection

Definition 3.1 A statistical hypothesis is a statement about the unknown values of the parameters of the population distribution.

Generalized Linear Models

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

Stats 579 Intermediate Bayesian Modeling. Assignment # 2 Solutions

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

ECE 275A Homework 7 Solutions

Math 152. Rumbos Fall Solutions to Assignment #12

Statistical Inference

Maximum Likelihood Estimation

Checking for Prior-Data Conflict

Association studies and regression

Testing Restrictions and Comparing Models

Equivalence of random-effects and conditional likelihoods for matched case-control studies

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Stat 5101 Lecture Notes

Bayesian Regression Linear and Logistic Regression

BAYESIAN RELIABILITY ANALYSIS FOR THE GUMBEL FAILURE MODEL. 1. Introduction

Transcription:

BS2 Statistical Inference, Lecture 3, Hilary Term 2009 January 30, 2009

Suppose that there is a minimal sufficient statistic T = t(x ) partitioned as T = (S, C) = (s(x ), c(x )) where: C1: the distribution of C depends on λ but not on ψ; C2: the conditional distribution of S given C = c depends on ψ but not λ, for all c; C3: the parameters vary independently, i.e. Θ = Ψ Λ. Then the likelihood function factorizes as L(θ x) f (s, c; θ) = f (s c; ψ)f (c; λ) and we say that C is ancillary for ψ, S is conditionally sufficient for ψ given C, and C is marginally sufficient for λ. We also say that C is a cut for λ and would then base inference about λ on the marginal distribution of C;

Suppose that there is a minimal sufficient statistic T = t(x ) partitioned as T = (S, C) = (s(x ), c(x )) where: C1: the distribution of C depends on λ but not on ψ; C2: the conditional distribution of S given C = c depends on ψ but not λ, for all c; C3: the parameters vary independently, i.e. Θ = Ψ Λ. Then the likelihood function factorizes as L(θ x) f (s, c; θ) = f (s c; ψ)f (c; λ) and we say that C is ancillary for ψ, S is conditionally sufficient for ψ given C, and C is marginally sufficient for λ. We also say that C is a cut for λ and would then base inference about λ on the marginal distribution of C; base inference about ψ on the conditional distribution of S given C = c.

Consider a sample X = (X 1,..., X n ) from a normal distribution N (µ, σ 2 ) where both µ and σ 2 are unknown. Recall that (U, V ) = ( X, S 2 = i (X i X i ) 2 ) is minimal sufficient and the likelihood function is L(µ, σ 2 x) f (u; µ, σ 2 )f (v; σ 2 ). If we do straight maximum likelihood estimation, we have ˆµ = U = X, ˆσ 2 = V /n. However, most statisticians agree that it is sensible to use σ 2 = V /(n 1) as the estimator of σ 2. Is this reasonable and is there a general rationale for this? Note that the common unbiasedness argument does not work as σ is not unbiased for the standard deviation σ, or σ 1 is not unbiased for the precision σ 2.

This example shows that we have to be very careful when nuisance parameters are present and straight likelihood considerations can lead us astray: We wish to establish the precision of a new instrument which measures with normal errors. We are therefore taking repeated measurements of individuals (X i1, X i2 ), i = 1,..., n which are all independent with X ij N (µ i, σ 2 ). Now consider U i = (X i1 + X i2 )/2, V i = (X i1 X i2 )/2. These are again independent and normally distributed as where τ 2 = σ 2 /2. U i N (µ i, τ 2 ), V i N (0, τ 2 ),

Clearly, we might as well consider (U i, V i ) as the original data. Also, the pair (U, W ) is minimal sufficient, where U = (U 1,..., U n ) and W = i V i 2, hence the likelihood function becomes L(µ, τ 2 ) (τ 2 ) n/2 e 1 2τ 2 i (u i µ i ) 2 (τ 2 ) n/2 e 1 2τ 2 i v i 2 = e 1 2τ 2 i (u i µ i ) 2 (τ 2 ) n e w 2τ 2. Thus the maximum likelihood estimator is ˆµ i = U i, i = 1,..., n; ˆτ 2 = W /2n. But W τ 2 χ 2 (n), so for large n, ˆτ 2 nτ 2 /(2n) = τ 2 /2!! So the additional parameters µ i are a serious nuisance if τ 2 is the parameter of interest.

The previous example shows that straight likelihood considerations may not lead to meaningful results when only a part of the parameter is considered. There are a number of suggestions for modifying the likelihood function to extract the evidence in the sample concerning a parameter of interest ψ when θ = (ψ, λ). Such modifications are generally known as pseudo-likelihood functions. Examples include: conditional likelihood, marginal likelihood, profile likelihood, integrated likelihood, and others, for example local, partial, restricted, residual, penalized, etc. The many names bear witness that straight likelihood considerations may not always be satisfactory.

Suppose we can write the joint density of a sufficient statistic T = (U, V ) as f (u; λ, ψ)f (v u; ψ), where ψ is the parameter of interest. Then, for fixed ψ, U is sufficient for λ. Inference for ψ can now be based on the conditional likelihood function L(ψ; v u) = f (v u; ψ), as the conditional distribution does not involve λ. The critical issue is whether (useful) information about ψ is lost by ignoring the factor f (u; λ, ψ).

In the normal example with many nuisance parameters, U = (U i, i = 1,..., n) is sufficient for the nuisance parameter λ = (µ i, i = 1,..., n) for fixed ψ = τ 2. Conditioning on U yields L(τ 2 ; w u) = f (w u; τ 2 ) = f (w; τ 2 ) = (τ 2 ) n/2 e w 2τ 2. This gives the conditional MLE ˆτ u 2 = W /n which is more sensible. It may be argued that U i N (µ i, τ 2 ) cannot possibly have information about τ 2. Or at least that the information it may have is not useful.

uses conditioning the other way around. Suppose we can write the joint density of a sufficient statistic T = (U, V ) as f (u v; λ, ψ)f (v; ψ), where ψ is the parameter of interest. Then the nuisance parameter λ can be eliminated by marginalization as it does not enter in the marginal distribution of V. Inference for ψ can now be based on the marginal likelihood function L(ψ; v) = f (v; ψ). The issue is also here whether (useful) information about ψ is lost by ignoring the factor f (u v; λ, ψ).

In the normal example with many nuisance parameters with λ = (µ i, i = 1,..., n) and ψ = τ 2 we get L(τ 2 ; w) = f (w; τ 2 ) = (τ 2 ) n/2 e w 2τ 2, which in this case is identical to the conditional likelihood function considered earlier and hence ˆτ 2 w = W /n. is in this case also known as residual likelihood because it is based on the residuals R i1 = X i1 ˆµ i1 = X i1 X i1 + X i2 2 R i2 = X i2 ˆµ i2 = X i2 X i1 + X i2 2 = X i1 X i2 2 = V i. = V i The corresponding estimates are then known as REML estimates.

Marginal and conditional likelihood changes the problem either by ignoring some of the data (by marginalization) or by ignoring their variability (by conditioning). attempts to stick to the original data distribution and likelihood function, but eliminates the nuisance parameters by maximization. The profile likelihood function ˆL(ψ) for ψ is defined as ˆL(ψ) = sup L(ψ, λ) = L{ψ, ˆλ(ψ)}, λ where ψ is the parameter of interest and ˆλ(ψ) is the MLE of λ when ψ is considered fixed.

Although the profile likelihood generally can be very useful, it does not help in the the normal example with many nuisance parameters with λ = (µ i, i = 1,..., n) and ψ = τ 2 we get ˆL(τ 2 ; w) = f (u; ˆµ, τ 2 )f (w; τ 2 ) = (τ 2 ) n e w 2τ 2, hence also peaks in the wrong place, at ˆτ 2 = W /(2n). We shall later return to various attempts at modifying the profile likelihood.

Another way of removing nuisance parameters from the likelihood is to use integration. This method is essentially Bayesian and demands the specification of a prior distribution π(λ ψ) of the nuisance parameter for fixed ψ. The integrated likelihood function is then defined as L(ψ) = L(ψ, λ)π(λ ψ) dλ. The integrated likelihood has the same fundamental relation to the marginal prior and posterior distributions as the ordinary likelihood. For if π(ψ) is the prior on ψ, the full posterior distribution is determined as and thus, by integration π (ψ) π (ψ, λ) π(ψ)π(λ ψ)l(ψ, λ) π (ψ, λ) dλ = π(ψ) L(ψ).

In the normal example with many nuisance parameters, we may for example consider µ i independent and normally distributed as µ i N (α, ω 2 ), where (α, ω 2 ) represent prior knowledge about the population from which µ i s are taken. The integrated likelihood for τ 2 can then be calculated as L(τ 2 ) = f (w; τ 2 ) f (u i ; µ i )π(µ i ; α, ω 2 ) dµ i. i The integral can be recognized as the marginal distribution of U where now U i are independent and identically distributed as N (α, τ 2 + ω 2 ).

Thus Ancillary cut L(τ 2 ) f (w; τ 2 )(τ 2 + ω 2 ) n/2 e 1 2(τ 2 +ω 2 ) i (u i α) 2 (τ 2 ) n/2 e w 2τ 2 (τ 2 + ω 2 ) n/2 e qα(u) 2(τ 2 +ω 2 ) where Q α (U) = i (U i α) 2. In this calculation, ω 2 and α are known and fixed. If these are correct, in the sense that µ i are in fact behaving as if they were i.i.d. N (α, ω 2 ), then the integrated likelihood will peak around the correct value, else the peak will be shifted to an incorrect position. So the influence of the prior prevails. Empirical Bayes or, equivalently(!), MLE in the random effects model, would also estimate α and ω 2 and get it right, as would Hierarchical Bayes, assigning a prior on (α, ω 2 ).