Semiparametric posterior limits

Similar documents
Bayesian Sparse Linear Regression with Unknown Symmetric Error

Statistica Sinica Preprint No: SS R2

Priors for the frequentist, consistency beyond Schwartz

ICES REPORT Model Misspecification and Plausibility

Minimax lower bounds I

Bayesian estimation of the discrepancy with misspecified parametric models

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Bayesian Regularization

Bayesian Aggregation for Extraordinarily Large Dataset

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Nonparametric Bayesian Uncertainty Quantification

A Very Brief Summary of Statistical Inference, and Examples

Introduction to Bayesian learning Lecture 2: Bayesian methods for (un)supervised problems

COMP90051 Statistical Machine Learning

A semiparametric Bernstein - von Mises theorem for Gaussian process priors

Lecture 17: Likelihood ratio and asymptotic tests

1 Local Asymptotic Normality of Ranks and Covariates in Transformation Models

Review and continuation from last week Properties of MLEs

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Brittleness and Robustness of Bayesian Inference

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Likelihood inference in the presence of nuisance parameters

Statistics: Learning models from data

Bayesian Asymptotics Under Misspecification

40.530: Statistics. Professor Chen Zehua. Singapore University of Design and Technology

Lecture 26: Likelihood ratio tests

Statistical Properties of Numerical Derivatives

On Consistent Hypotheses Testing

A Very Brief Summary of Bayesian Inference, and Examples

Finite Sample Bernstein von Mises Theorem for Semiparametric Problems

Brittleness and Robustness of Bayesian Inference

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Asymptotic properties of posterior distributions in nonparametric regression with non-gaussian errors

1. Fisher Information

Bayesian nonparametrics

ECE531 Lecture 10b: Maximum Likelihood Estimation

Estimation for two-phase designs: semiparametric models and Z theorems

BAYESIAN ASYMPTOTICS INVERSE PROBLEMS AND IRREGULAR MODELS

DISCUSSION: COVERAGE OF BAYESIAN CREDIBLE SETS. By Subhashis Ghosal North Carolina State University

Statistical Inference

Gaussian Approximations for Probability Measures on R d

parameter space Θ, depending only on X, such that Note: it is not θ that is random, but the set C(X).

The Bernstein-Von-Mises theorem under misspecification

Mathematics Ph.D. Qualifying Examination Stat Probability, January 2018

Lecture 8: Information Theory and Statistics

D I S C U S S I O N P A P E R

simple if it completely specifies the density of x

Lecture 1: Introduction

Closest Moment Estimation under General Conditions

Weak convergence of Markov chain Monte Carlo II

Testing Algebraic Hypotheses

Statistics Ph.D. Qualifying Exam

The Bayesian Analysis of Complex, High-Dimensional Models: Can it be CODA?

Local Asymptotic Normality

Introduction to Empirical Processes and Semiparametric Inference Lecture 25: Semiparametric Models

The International Journal of Biostatistics

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Lecture 7 Introduction to Statistical Decision Theory

6.1 Variational representation of f-divergences

Efficiency of Profile/Partial Likelihood in the Cox Model

arxiv:submit/ [math.st] 6 May 2011

Statistical Inference

MIT Spring 2016

Can we do statistical inference in a non-asymptotic way? 1

Verifying Regularity Conditions for Logit-Normal GLMM

Why (Bayesian) inference makes (Differential) privacy easy

Stat 5101 Lecture Notes

27 Superefficiency. A. W. van der Vaart Introduction

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

Information in Data. Sufficiency, Ancillarity, Minimality, and Completeness

Construction of an Informative Hierarchical Prior Distribution: Application to Electricity Load Forecasting

Time Series and Dynamic Models

1 Glivenko-Cantelli type theorems

Closest Moment Estimation under General Conditions

Bayesian Regression with Heteroscedastic Error Density and Parametric Mean Function

Invariant HPD credible sets and MAP estimators

Towards stability and optimality in stochastic gradient descent

Various types of likelihood

Bayesian Asymptotics

Lecture 2: From Linear Regression to Kalman Filter and Beyond

1.1 Basis of Statistical Decision Theory

Covariance function estimation in Gaussian process regression

Asymptotics for posterior hazards

M- and Z- theorems; GMM and Empirical Likelihood Wellner; 5/13/98, 1/26/07, 5/08/09, 6/14/2010

Final Exam. 1. (6 points) True/False. Please read the statements carefully, as no partial credit will be given.

Theoretical Statistics. Lecture 23.

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

Introduction to Empirical Processes and Semiparametric Inference Lecture 09: Stochastic Convergence, Continued

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

1 Hypothesis Testing and Model Selection

Nancy Reid SS 6002A Office Hours by appointment

5.1 Approximation of convex functions

LAN property for ergodic jump-diffusion processes with discrete observations

Semiparametric Efficiency in Irregularly Identified Models

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½

Asymptotic inference for a nonstationary double ar(1) model

Chapter 1: A Brief Review of Maximum Likelihood, GMM, and Numerical Tools. Joan Llull. Microeconometrics IDEA PhD Program

Thanks. Presentation help: B. Narasimhan, N. El Karoui, J-M Corcuera Grant Support: NIH, NSF, ARC (via P. Hall) p.2

Asymptotics of minimax stochastic programs

STAT215: Solutions for Homework 2

Transcription:

Statistics Department, Seoul National University, Korea, 2012 Semiparametric posterior limits for regular and some irregular problems Bas Kleijn, KdV Institute, University of Amsterdam Based on collaborations with P. Bickel and B. Knapik 1

Regular semiparametric estimation (Part I) Partial Linear Regression Consider an i.i.d. sample X 1,..., X n of the form X = (Y, U, V ) R 3, assumed to be related as, Y = θu + η(v ) + e where e N(0, 1) independent of (U, V ) P, θ R, η H. Question Under which conditions (on H, P ) can we estimate parameter of interest θ (efficiently) in the presence of nuisance parameter η? Regularity Density is suitably differentiable in θ with non-singular FI. 2

Irregular semiparametric estimation (Part II) Model Observe i.i.d. sample X 1,..., X n B(a, b) n with B(a, b) shifted/scaled β( 2 1, 1 2 ), a R, b (0, ). Question How do we estimate the location of B(a, b)? Answer 1 (regular) X n (Euler, after 1780) Answer 2 (irregular) 1 2 (X (1) + X (n)) (Bernoulli, 1777) Difference ROC n 1/2 (regular) versus n 2/3 (irregular) Semiparametric version Replace β by unknown nuisance η (supported on [0, 1], with specified boundary behaviour). 3

Part I The semiparametric Bernstein-Von Mises theorem 4

Semiparametric inference Frequentist semiparametric setup Data P 0 -i.i.d., model P = { P θ,η : θ Θ, η H }, assume P 0 P A semiparametric Bernstein-Von Mises theorem asserts Convergence of θ-posterior to efficient sampling distribution in the presence of infinite-dimensional nuisance η As such, sbvm combines aspects of Parametric Bernstein-Von Mises theorem (Le Cam (1950 s)) Nonparametric consistency (Schwartz (1965), Ghosal ea. (2001)) 5

Stochastic Local Asymptotic Normality Definition (Le Cam (1953)) There is a l θ0 L 2 (P θ0 ) with P θ0 l θ0 = 0 s.t. for any (h n ) = O Pθ0 (1), n i=1 p θ0 +n 1/2 h n p θ0 (X i ) = exp ( h T n n,θ 0 1 2 ht n I θ0 h n + o Pθ0 (1) ), where n,θ 0 is given by, n,θ 0 = 1 n n i=1 l θ0 (X i ), and I θ0 = P θ0 l θ0 l T θ 0 is the Fisher information. Efficiency (Fisher, Cramér, Rao, Le Cam, Hájek) An estimator ˆθ n for θ 0 is best-regular if and only if, n(ˆθ n θ 0 ) = n,θ0 +o P0 (1), where n,θ0 (X) = I 1 θ 0 n,θ 0 6

The parametric Bernstein-Von Mises theorem Theorem 1. (Bernstein-Von Mises, h = n(θ θ 0 )) Let P = {P θ : θ Θ R d } with thick prior Π Θ be LAN at θ 0 with non-singular I θ0. Assume that for every sequence of radii M n, Π ( h M n X1,..., X n ) P0 1 (1) Then the posterior converges to normality as follows ) sup Π( h B X 1,..., X n N n,θ0,i θ 1 (B) 0 B P 0 0 (2) Another, more familiar form of the assertion, ) sup Π( θ B X 1,..., X n Nˆθ n,(ni θ0 ) 1 (B) B for any best-regular ˆθ n. P 0 0 7

Posterior consistency and rates of convergence Theorem 2. (Posterior consistency (Schwartz (1965)) Let P be a dominated model with metric d and prior Π. Let X 1, X 2,... be i.i.d.-p 0 with P 0 P. Assume that covering numbers are finite, N(ɛ, P, d) <, (for all ɛ > 0) and the prior mass of KL-neighbourhoods of P 0 is strictly positive, Π ( P P : P 0 log(p/p 0 ) ɛ ) > 0, (for all ɛ > 0) Then the posterior is consistent, i.e. for all ɛ > 0, Π ( ) P0 a.s. d(p, P 0 ) ɛ X 1,..., X n 0. Stronger formulation for rates-of-convergence (Ghosal et al. (2001)) 8

Semiparametric Bernstein-Von Mises theorem some definitions With θ n (h) = θ 0 + h/ n and for given ρ > 0, M > 0, n 1 ( ) K n (ρ, M) = {η H : P 0 log(p θn (h),η /p 0) ρ 2, sup h M P 0 ( sup h M and K(ρ) = K(ρ, 0) (c.f. Ghosal et al. (2001)). ) 2 log(p θn (h),η /p 0) ρ 2} U n (r, h n ) relates to uniform total-variational distance sup{ P n θ n (h n ),η P n θ 0,η T V : η H, d H (η, η 0 ) < r} 9

Semiparametric Bernstein-Von Mises theorem Theorem 3. Equip Θ H with prior Π Θ Π H. Suppose that Π Θ is thick, that the model is slan and that the efficient Fisher information Ĩ θ0,η 0 is non-singular. Also (i) ρ>0 : Π H (K(ρ)) > 0 and N(ρ, H, d H ) < (ii) M>0 L>0 ρ>0 : K(ρ) K n (Lρ, M) for large enough n and that for every bounded, stochastic (h n ), (iii) r>0 : U n (r, h n ) = O(1) (iv) sup η H H(P θn (h n ),η, P θ 0,η) = O(n 1/2 ) and that the marginal θ-posterior contracts at parametric rate. Then the posterior satisfies the Bernstein-Von Mises assertion sup Π( h B B X 1,..., X n ) N n,ĩ 1 θ 0,η 0 (B) P 0 0 10

Partial linear regression Observe i.i.d.-p 0 sample X 1, X 2,..., X i = (U i, V i, Y i ) modelled by Y = θ 0 U + η 0 (V ) + e where e N(0, 1) independent of (U, V ) P, P U = 0, P U 2 = 1, P U 4 <, P (U E[U V ]) 2 > 0, P (U E[U V ]) 4 <. For given α > 0, M > 0, define H α,m = {η C α [0, 1] : η α < M}. Theorem 4. Let α > 1/2 and M > 0 be given. Assume that η 0 as well as v E[U V = v] are in H α,m. Let Π Θ be thick. Choose k > α 1/2 and define Π k α,m to be the distribution of k times integrated Brownian motion started at random, conditioned on η α < M. Then, sup Π( h A A X 1,..., X n ) N n,ĩ 1 θ 0,f 0 (A) P 0 0, where l θ0,η 0 (X) = e(u E[U V ]) and Ĩ θ0,η 0 = P (U E[U V ]) 2. 11

Consistency under n-perturbation Graph/Heuristics H η*(θ) D(θ,ρ) (θ0,η0) θ Θ U0 The nuisance posterior conditional on θ concentrates around least-favourable θ η (θ). 12

Consistency under n-perturbation Theorem Based on the submodel θ η (θ), define (fixed θ and ρ > 0), D(θ, ρ) = {η H : H(P θ0,η, P θ,η (θ) ) < ρ} Theorem 5. (Consistency under n-perturbation) Assume that (i) For every ρ > 0, Π H (K(ρ)) > 0 and N(ρ, H, d H ) < (ii) M>0 L>0 ρ>0 : K(ρ) K n (Lρ, M) for large enough n (iii) For all bounded (h n ), sup η H H(P θn (h n ),η, P θ 0,η) = O(n 1/2 ) Then, Π ( D c (θ, ρ n ) θ = θ 0 + n 1/2 h n ; X 1,..., X n ) = op0 (1), for all h n = O P0 (1). 13

Integral local asymptotic normality Graph/Heuristics ζ =6 H ζ =5 g ζ =0 ζ =3 ζ =1 η* g ζ =1 ζ =4 ζ =2 g ζ =2 Θ g ζ =-1 (θ0,η0) g ζ =-2 g ζ =-3 n -1/2 ζ =0 g ζ =-4 Adaptive reparametrization around (θ0, η0 ) for η = η0 + ζ, consider (θ, ζ) 7 (θ, η (θ) + ζ) 14

Integral local asymptotic normality Theorem In the following theorem, we describe the LAN expansion of, h s n (h) = assumed to be continuous. H n i=1 p θ0 +n 1/2 h,η p 0 (X i ) dπ H (η), Theorem 6. (Integral local asymptotic normality) Suppose that the model is slan and that there is an r > 0 such that U n (r, h n ) = O(1). Furthermore, assume that consistency under n-perturbation obtains. Then, for every hn = O P0 (1), log s n (h n ) = log s n (0) + h T n G n l θ0,η 0 1 2 ht n Ĩθ 0,η 0 h n + o P0 (1) (3) 15

Parametric posterior Posterior asymptotic normality Analogy/Heuristics The posterior density θ dπ(θ X 1,..., X n ) n i=1 p θ (X i ) dπ(θ) / n Θ i=1 with LAN requirement on the likelihood. Semiparametric analog p θ (X i ) dπ(θ) The marginal posterior density θ dπ(θ X 1,..., X n ) H n i=1 p θ,η (X i ) dπ H (η) dπ Θ (θ) / Θ H n i=1 p θ,η (X i ) dπ H (η) dπ Θ (θ) with integral LAN requirement on Π H -integrated likelihood. Then Le Cam s parametric proof stays intact! 16

Posterior asymptotic normality Theorem Theorem 7. (Marginal posterior asymptotic normality) Suppose that Π Θ is thick and that h s n (h) satisfies the ILAN property with non-singular Ĩ θ0,η 0. Assume that for every sequence of radii M n, Π ( h M n X1,..., X n ) P0 1 Then the marginal posterior for θ converges to normality as follows sup Π( h B B X 1,..., X n ) N n,ĩ 1 θ 0,η 0 (B) P 0 0 17

Marginal convergence at rate n Condition for marginal posterior asymptotic normality: strips of form Θ n H = { (θ, η) Θ H : n θ θ 0 M n } receive posterior mass one asymptotically, for all M n. Lemma 8. (Marginal parametric rate (I)) Let h s n (h) be ILAN. Assume there exists a constant C > 0 such that for any M n, Then P n 0 for any M n. ( sup η H sup P n log p θ,η θ Θ n p θ0,η CM 2 n n ) 1 Π ( n 1/2 θ θ 0 > M n X1,..., X n ) P0 0 18

Marginal convergence at rate n continued Theorem 9. (Marginal parametric rate (II)) Let Π Θ and Π H be given. Assume that there exists a sequence (H n ) of subsets of H, such that the following two conditions hold: (i) The nuisance posterior concentrates on H n Π ( η H \ H n X1,..., X n ) P0 0 (ii) For every M n, Then for every M n sup η H n P n 0 Π( n 1/2 θ θ 0 > M n η, X 1,..., X n ) 0 Π ( n 1/2 θ θ 0 > M n η, X1,..., X n ) P0 0 19

Bias and marginal convergence at rate n A nasty subtlety Misspecified parametric BvM (BK and van der Vaart (2012)): every fixed η H, the conditional posterior, for θ dπ( θ η, X 1,..., X n ) contracts to θ (η) at minimal KL divergence w.r.t. P 0. So unless sup θ (η) θ 0 = o( n), η H n an asymptotic bias ruins BvM! (see also Castillo (2012)) Under regularity conditions (van der Vaart (1998)), ˆθ n for θ 0 is regular but asymptotically biased, n(ˆθ n θ 0 ) = n,θ0,η 0 + sup Ĩθ 1 η D 0,η P 0 θ0,η l θ0,η 0 + o P0 (1), n 20

Part II Posterior limits in a class of irregular semiparametric problems 21

Stochastic Local Asymptotic Exponentiality Definition There exists a η > 0 such that for any bounded, stochastic (h n ), n i=1 where n satisfies, p θ0 +n 1 h n p θ0 (X i ) = exp ( h n η + o Pθ0 (1) ) 1{h n n }, lim n P θ n 0 ( n > u) = e ηu, for all u > 0. (Ibrigimov and Has minskii (1981)). Definitions for K(ρ), K n (ρ, M) and U n are analogous to LAN case 22

LAE Bernstein-Von Mises theorem Theorem 10. Equip Θ H with prior Π Θ Π H. Suppose that Π Θ is thick, that the model is slae with η 0 > 0. Also (i) ρ>0 : Π H (K(ρ)) > 0 and N(ρ, H, d H ) < (ii) M>0 L>0 ρ>0 : K(ρ) K n (Lρ, M) for large enough n and that for every bounded, stochastic (h n ), (iii) r>0 : U n (r, h n ) = O(1) (iv) sup η H H(P θn (h n ),η, P θ 0,η) = O(n 1 ) and that the marginal θ-posterior contracts at rate 1/n. Then the posterior satisfies ) sup Π( h B X 1,..., X n Exp n, η (B) 0 B P 0 0 23

Problem Estimation of domain boundary Definitions Observe X 1,..., X n i.i.d.-p θ0,η 0 with Lebesgue density p θ0,η 0 (x) = η 0 (x θ 0 ), η 0 (y) = 0, if y < 0. and η 0 = η 0 (0) > 0. Estimate θ 0 with η 0 H, an unknown nuisance. Model Define L = C S [0, ] (cont. f : [0, ] R such that f S) L H : l η, η(x) = Z 1 e l α x+ x 0 l(t) dt, (Z l normalizes) η H is monotone decreasing, differentiable and log-lipschitz. Influence function In this case, n,θ0 = n(x (1) θ 0 ). 24

Estimation of domain boundary prior and BvM theorem Lemma 11. Let S > 0, W = {W s : s [0, 1]} BM on [0, 1], Z N(0, 1), indept. of W. Let Ψ : [0, ] [0, 1], t (2/π) arctan(t). Define l Π by, Then C S [0, ] supp(π). l(t) = S Ψ(Z + W Ψ(t) ). Theorem 12. Let X 1,..., X n i.i.d.-p 0, assume P 0 = P θ0,η 0 in model. Endow Θ = R with prior thick at θ 0 and H with prior Π like above. Then, sup Π( h B B where n,θ0 = n(x (1) θ 0 ). X 1,..., X n ) Exp n,θ0, η 0 (B) P 0 0 25

Estimation of domain boundaries sub-optimality of MLE and Bayes estimates Consider an i.i.d. sample X 1,..., X n from U[0, θ 0 ]. In Le Cam (1990) (see also Ibrahimov-Has minskii (1981)) it is shown that ˆθ n = X (n) is the MLE for θ 0 and, P n θ 0 n(ˆθ n θ 0 ) 2 = 2n (n + 1)(n + 2) θ2 0 whereas the estimator θ n = (n + 2)/(n + 1) X (n) leads to, P n θ 0 n( θ n θ 0 ) 2 = n (n + 1) 2θ2 0 so that the relative efficiency of ˆθ n versus θ n is P n θ 0 (ˆθ n θ 0 ) 2 P n θ 0 ( θ n θ 0 ) 2 = 2n + 1 n + 2 > 1 (!) 26