Semiparametric posterior limits

Statistics Department, Seoul National University, Korea, 2012 Semiparametric posterior limits for regular and some irregular problems Bas Kleijn, KdV Institute, University of Amsterdam Based on collaborations with P. Bickel and B. Knapik 1

Regular semiparametric estimation (Part I) Partial Linear Regression Consider an i.i.d. sample X 1,..., X n of the form X = (Y, U, V ) R 3, assumed to be related as, Y = θu + η(v ) + e where e N(0, 1) independent of (U, V ) P, θ R, η H. Question Under which conditions (on H, P ) can we estimate parameter of interest θ (efficiently) in the presence of nuisance parameter η? Regularity Density is suitably differentiable in θ with non-singular FI. 2

Irregular semiparametric estimation (Part II) Model Observe i.i.d. sample X 1,..., X n B(a, b) n with B(a, b) shifted/scaled β( 2 1, 1 2 ), a R, b (0, ). Question How do we estimate the location of B(a, b)? Answer 1 (regular) X n (Euler, after 1780) Answer 2 (irregular) 1 2 (X (1) + X (n)) (Bernoulli, 1777) Difference ROC n 1/2 (regular) versus n 2/3 (irregular) Semiparametric version Replace β by unknown nuisance η (supported on [0, 1], with specified boundary behaviour). 3

Part I The semiparametric Bernstein-Von Mises theorem 4

Semiparametric inference Frequentist semiparametric setup Data P 0 -i.i.d., model P = { P θ,η : θ Θ, η H }, assume P 0 P A semiparametric Bernstein-Von Mises theorem asserts Convergence of θ-posterior to efficient sampling distribution in the presence of infinite-dimensional nuisance η As such, sbvm combines aspects of Parametric Bernstein-Von Mises theorem (Le Cam (1950 s)) Nonparametric consistency (Schwartz (1965), Ghosal ea. (2001)) 5

Stochastic Local Asymptotic Normality Definition (Le Cam (1953)) There is a l θ0 L 2 (P θ0 ) with P θ0 l θ0 = 0 s.t. for any (h n ) = O Pθ0 (1), n i=1 p θ0 +n 1/2 h n p θ0 (X i ) = exp ( h T n n,θ 0 1 2 ht n I θ0 h n + o Pθ0 (1) ), where n,θ 0 is given by, n,θ 0 = 1 n n i=1 l θ0 (X i ), and I θ0 = P θ0 l θ0 l T θ 0 is the Fisher information. Efficiency (Fisher, Cramér, Rao, Le Cam, Hájek) An estimator ˆθ n for θ 0 is best-regular if and only if, n(ˆθ n θ 0 ) = n,θ0 +o P0 (1), where n,θ0 (X) = I 1 θ 0 n,θ 0 6

The parametric Bernstein-Von Mises theorem Theorem 1. (Bernstein-Von Mises, h = n(θ θ 0 )) Let P = {P θ : θ Θ R d } with thick prior Π Θ be LAN at θ 0 with non-singular I θ0. Assume that for every sequence of radii M n, Π ( h M n X1,..., X n ) P0 1 (1) Then the posterior converges to normality as follows ) sup Π( h B X 1,..., X n N n,θ0,i θ 1 (B) 0 B P 0 0 (2) Another, more familiar form of the assertion, ) sup Π( θ B X 1,..., X n Nˆθ n,(ni θ0 ) 1 (B) B for any best-regular ˆθ n. P 0 0 7

Posterior consistency and rates of convergence Theorem 2. (Posterior consistency (Schwartz (1965)) Let P be a dominated model with metric d and prior Π. Let X 1, X 2,... be i.i.d.-p 0 with P 0 P. Assume that covering numbers are finite, N(ɛ, P, d) <, (for all ɛ > 0) and the prior mass of KL-neighbourhoods of P 0 is strictly positive, Π ( P P : P 0 log(p/p 0 ) ɛ ) > 0, (for all ɛ > 0) Then the posterior is consistent, i.e. for all ɛ > 0, Π ( ) P0 a.s. d(p, P 0 ) ɛ X 1,..., X n 0. Stronger formulation for rates-of-convergence (Ghosal et al. (2001)) 8

Semiparametric Bernstein-Von Mises theorem some definitions With θ n (h) = θ 0 + h/ n and for given ρ > 0, M > 0, n 1 ( ) K n (ρ, M) = {η H : P 0 log(p θn (h),η /p 0) ρ 2, sup h M P 0 ( sup h M and K(ρ) = K(ρ, 0) (c.f. Ghosal et al. (2001)). ) 2 log(p θn (h),η /p 0) ρ 2} U n (r, h n ) relates to uniform total-variational distance sup{ P n θ n (h n ),η P n θ 0,η T V : η H, d H (η, η 0 ) < r} 9

Semiparametric Bernstein-Von Mises theorem Theorem 3. Equip Θ H with prior Π Θ Π H. Suppose that Π Θ is thick, that the model is slan and that the efficient Fisher information Ĩ θ0,η 0 is non-singular. Also (i) ρ>0 : Π H (K(ρ)) > 0 and N(ρ, H, d H ) < (ii) M>0 L>0 ρ>0 : K(ρ) K n (Lρ, M) for large enough n and that for every bounded, stochastic (h n ), (iii) r>0 : U n (r, h n ) = O(1) (iv) sup η H H(P θn (h n ),η, P θ 0,η) = O(n 1/2 ) and that the marginal θ-posterior contracts at parametric rate. Then the posterior satisfies the Bernstein-Von Mises assertion sup Π( h B B X 1,..., X n ) N n,ĩ 1 θ 0,η 0 (B) P 0 0 10

Partial linear regression Observe i.i.d.-p 0 sample X 1, X 2,..., X i = (U i, V i, Y i ) modelled by Y = θ 0 U + η 0 (V ) + e where e N(0, 1) independent of (U, V ) P, P U = 0, P U 2 = 1, P U 4 <, P (U E[U V ]) 2 > 0, P (U E[U V ]) 4 <. For given α > 0, M > 0, define H α,m = {η C α [0, 1] : η α < M}. Theorem 4. Let α > 1/2 and M > 0 be given. Assume that η 0 as well as v E[U V = v] are in H α,m. Let Π Θ be thick. Choose k > α 1/2 and define Π k α,m to be the distribution of k times integrated Brownian motion started at random, conditioned on η α < M. Then, sup Π( h A A X 1,..., X n ) N n,ĩ 1 θ 0,f 0 (A) P 0 0, where l θ0,η 0 (X) = e(u E[U V ]) and Ĩ θ0,η 0 = P (U E[U V ]) 2. 11

Consistency under n-perturbation Graph/Heuristics H η*(θ) D(θ,ρ) (θ0,η0) θ Θ U0 The nuisance posterior conditional on θ concentrates around least-favourable θ η (θ). 12

Consistency under n-perturbation Theorem Based on the submodel θ η (θ), define (fixed θ and ρ > 0), D(θ, ρ) = {η H : H(P θ0,η, P θ,η (θ) ) < ρ} Theorem 5. (Consistency under n-perturbation) Assume that (i) For every ρ > 0, Π H (K(ρ)) > 0 and N(ρ, H, d H ) < (ii) M>0 L>0 ρ>0 : K(ρ) K n (Lρ, M) for large enough n (iii) For all bounded (h n ), sup η H H(P θn (h n ),η, P θ 0,η) = O(n 1/2 ) Then, Π ( D c (θ, ρ n ) θ = θ 0 + n 1/2 h n ; X 1,..., X n ) = op0 (1), for all h n = O P0 (1). 13

Integral local asymptotic normality Graph/Heuristics ζ =6 H ζ =5 g ζ =0 ζ =3 ζ =1 η* g ζ =1 ζ =4 ζ =2 g ζ =2 Θ g ζ =-1 (θ0,η0) g ζ =-2 g ζ =-3 n -1/2 ζ =0 g ζ =-4 Adaptive reparametrization around (θ0, η0 ) for η = η0 + ζ, consider (θ, ζ) 7 (θ, η (θ) + ζ) 14

Integral local asymptotic normality Theorem In the following theorem, we describe the LAN expansion of, h s n (h) = assumed to be continuous. H n i=1 p θ0 +n 1/2 h,η p 0 (X i ) dπ H (η), Theorem 6. (Integral local asymptotic normality) Suppose that the model is slan and that there is an r > 0 such that U n (r, h n ) = O(1). Furthermore, assume that consistency under n-perturbation obtains. Then, for every hn = O P0 (1), log s n (h n ) = log s n (0) + h T n G n l θ0,η 0 1 2 ht n Ĩθ 0,η 0 h n + o P0 (1) (3) 15

Parametric posterior Posterior asymptotic normality Analogy/Heuristics The posterior density θ dπ(θ X 1,..., X n ) n i=1 p θ (X i ) dπ(θ) / n Θ i=1 with LAN requirement on the likelihood. Semiparametric analog p θ (X i ) dπ(θ) The marginal posterior density θ dπ(θ X 1,..., X n ) H n i=1 p θ,η (X i ) dπ H (η) dπ Θ (θ) / Θ H n i=1 p θ,η (X i ) dπ H (η) dπ Θ (θ) with integral LAN requirement on Π H -integrated likelihood. Then Le Cam s parametric proof stays intact! 16

Posterior asymptotic normality Theorem Theorem 7. (Marginal posterior asymptotic normality) Suppose that Π Θ is thick and that h s n (h) satisfies the ILAN property with non-singular Ĩ θ0,η 0. Assume that for every sequence of radii M n, Π ( h M n X1,..., X n ) P0 1 Then the marginal posterior for θ converges to normality as follows sup Π( h B B X 1,..., X n ) N n,ĩ 1 θ 0,η 0 (B) P 0 0 17

Marginal convergence at rate n Condition for marginal posterior asymptotic normality: strips of form Θ n H = { (θ, η) Θ H : n θ θ 0 M n } receive posterior mass one asymptotically, for all M n. Lemma 8. (Marginal parametric rate (I)) Let h s n (h) be ILAN. Assume there exists a constant C > 0 such that for any M n, Then P n 0 for any M n. ( sup η H sup P n log p θ,η θ Θ n p θ0,η CM 2 n n ) 1 Π ( n 1/2 θ θ 0 > M n X1,..., X n ) P0 0 18

Marginal convergence at rate n continued Theorem 9. (Marginal parametric rate (II)) Let Π Θ and Π H be given. Assume that there exists a sequence (H n ) of subsets of H, such that the following two conditions hold: (i) The nuisance posterior concentrates on H n Π ( η H \ H n X1,..., X n ) P0 0 (ii) For every M n, Then for every M n sup η H n P n 0 Π( n 1/2 θ θ 0 > M n η, X 1,..., X n ) 0 Π ( n 1/2 θ θ 0 > M n η, X1,..., X n ) P0 0 19

Bias and marginal convergence at rate n A nasty subtlety Misspecified parametric BvM (BK and van der Vaart (2012)): every fixed η H, the conditional posterior, for θ dπ( θ η, X 1,..., X n ) contracts to θ (η) at minimal KL divergence w.r.t. P 0. So unless sup θ (η) θ 0 = o( n), η H n an asymptotic bias ruins BvM! (see also Castillo (2012)) Under regularity conditions (van der Vaart (1998)), ˆθ n for θ 0 is regular but asymptotically biased, n(ˆθ n θ 0 ) = n,θ0,η 0 + sup Ĩθ 1 η D 0,η P 0 θ0,η l θ0,η 0 + o P0 (1), n 20

Part II Posterior limits in a class of irregular semiparametric problems 21

Stochastic Local Asymptotic Exponentiality Definition There exists a η > 0 such that for any bounded, stochastic (h n ), n i=1 where n satisfies, p θ0 +n 1 h n p θ0 (X i ) = exp ( h n η + o Pθ0 (1) ) 1{h n n }, lim n P θ n 0 ( n > u) = e ηu, for all u > 0. (Ibrigimov and Has minskii (1981)). Definitions for K(ρ), K n (ρ, M) and U n are analogous to LAN case 22

LAE Bernstein-Von Mises theorem Theorem 10. Equip Θ H with prior Π Θ Π H. Suppose that Π Θ is thick, that the model is slae with η 0 > 0. Also (i) ρ>0 : Π H (K(ρ)) > 0 and N(ρ, H, d H ) < (ii) M>0 L>0 ρ>0 : K(ρ) K n (Lρ, M) for large enough n and that for every bounded, stochastic (h n ), (iii) r>0 : U n (r, h n ) = O(1) (iv) sup η H H(P θn (h n ),η, P θ 0,η) = O(n 1 ) and that the marginal θ-posterior contracts at rate 1/n. Then the posterior satisfies ) sup Π( h B X 1,..., X n Exp n, η (B) 0 B P 0 0 23

Problem Estimation of domain boundary Definitions Observe X 1,..., X n i.i.d.-p θ0,η 0 with Lebesgue density p θ0,η 0 (x) = η 0 (x θ 0 ), η 0 (y) = 0, if y < 0. and η 0 = η 0 (0) > 0. Estimate θ 0 with η 0 H, an unknown nuisance. Model Define L = C S [0, ] (cont. f : [0, ] R such that f S) L H : l η, η(x) = Z 1 e l α x+ x 0 l(t) dt, (Z l normalizes) η H is monotone decreasing, differentiable and log-lipschitz. Influence function In this case, n,θ0 = n(x (1) θ 0 ). 24

Estimation of domain boundary prior and BvM theorem Lemma 11. Let S > 0, W = {W s : s [0, 1]} BM on [0, 1], Z N(0, 1), indept. of W. Let Ψ : [0, ] [0, 1], t (2/π) arctan(t). Define l Π by, Then C S [0, ] supp(π). l(t) = S Ψ(Z + W Ψ(t) ). Theorem 12. Let X 1,..., X n i.i.d.-p 0, assume P 0 = P θ0,η 0 in model. Endow Θ = R with prior thick at θ 0 and H with prior Π like above. Then, sup Π( h B B where n,θ0 = n(x (1) θ 0 ). X 1,..., X n ) Exp n,θ0, η 0 (B) P 0 0 25

Estimation of domain boundaries sub-optimality of MLE and Bayes estimates Consider an i.i.d. sample X 1,..., X n from U[0, θ 0 ]. In Le Cam (1990) (see also Ibrahimov-Has minskii (1981)) it is shown that ˆθ n = X (n) is the MLE for θ 0 and, P n θ 0 n(ˆθ n θ 0 ) 2 = 2n (n + 1)(n + 2) θ2 0 whereas the estimator θ n = (n + 2)/(n + 1) X (n) leads to, P n θ 0 n( θ n θ 0 ) 2 = n (n + 1) 2θ2 0 so that the relative efficiency of ˆθ n versus θ n is P n θ 0 (ˆθ n θ 0 ) 2 P n θ 0 ( θ n θ 0 ) 2 = 2n + 1 n + 2 > 1 (!) 26