Priors for the frequentist, consistency beyond Schwartz

Size: px

Start display at page:

Download "Priors for the frequentist, consistency beyond Schwartz"

Leon Claude Fleming
5 years ago
Views:

1 Victoria University, Wellington, New Zealand, 11 January 2016 Priors for the frequentist, consistency beyond Schwartz Bas Kleijn, KdV Institute for Mathematics

2 Part I Introduction

3 Bayesian and Frequentist statistics sample space (X, B) measurable space i.i.d. data X n = (X 1,..., X n ) X n frequentist/bayesian model (P, G ) model subsets B, V G prior Π : G [0, 1] probability measure posterior Π( X n ) : G [0, 1] Bayes s rule, inference Frequentist assume there is P 0 X n P0 n Bayes assume P Π X n P P n 3

4 Definition of the posterior Definition 4.1 Assume that all P P n (A) are G -measurable. Given prior Π, a posterior is any Π( X n = ) : G X n [0, 1] (i) For any G G, x n Π(G X n = x n ) is B n -measurable (ii) (Disintegration) For all A B n and G G A Π(G Xn ) dpn Π = P n (A) dπ(p ) G where Pn Π = P n dπ(p ) is the prior predictive distribution Remark 4.2 For frequentists (X 1,..., X n ) P n 0, so assume P n 0 P Π n 4

5 Asymptotic consistency of the posterior n=0 n=1 n= n=9 n=16 n= n=36 n=64 n= n=144 n=256 n= Definition 5.1 Given a model P with topology and a Borel prior Π, the posterior is consistent at P P if for every open nbd U of P Π(U X n ) P 1 5

6 Doob s and Schwartz s consistency theorems Theorem 6.1 (Doob (1948)) Let P and X be Polish spaces. Assume that P P n (A) is Borel measurable for all n, A. Then for any prior Π, the posterior is consistent at P, for Π-almost-all P P Remark 6.2 (Schwartz (1961), Freedman (1963)) Not frequentist! Theorem 6.3 (Schwartz (1965)) Let X 1, X 2,... be an i.i.d.-sample from P 0 P. Let P be Hellinger totally bounded and let Π be a Kullback-Leibler (KL-)prior, i.e. Π ( P P : P 0 log dp/dp 0 < ɛ ) > 0 for all ɛ > 0. Then the posterior is consistent at P 0 in the Hellinger topology 6

The Dirichlet process Definition 7.1 (Dirichlet distribution) A random variable p = (p 1,..., p k ) with p l 0 and l p l = 1 is Dirichlet distributed with parameter α = (α 1,.

7 The Dirichlet process Definition 7.1 (Dirichlet distribution) A random variable p = (p 1,..., p k ) with p l 0 and l p l = 1 is Dirichlet distributed with parameter α = (α 1,..., α k ), p D α, if it has density f α (p) = C(α) k l=1 p α l 1 l Definition 7.2 (Dirichlet process, Ferguson ) Let α be a finite measure on (X, B). The Dirichlet process P D α is defined by, (for all finite msb partitions A = {A 1,..., A k } of X ) ( P (A1 ),..., P (A k ) ) D (α(a1 ),...,α(a k )) 7

8 Weak consistency with Dirichlet priors Theorem 8.1 (Dirichlet consistency) Let X 1, X 2,... be an i.i.d.-sample from P 0 If Π is a Dirichlet prior D α with finite α such that supp(p 0 ) supp(α), the posterior is consistent at P 0 in the weak model topology Remark 8.2 Priors are not necessarily KL for consistency Remark 8.3 (Freedman (1965)) Dirichlet distributions are tailfree: if A refines A and A i1... A il i = A i, then (P (A i1 A i),..., P (A il i A i ) : 1 i k) is independent of (P (A 1 ),..., P (A k )). Remark 8.4 X n Π(P (A) X n ) is σ n (A)-measurable where σ n (A) is generated by products of the form n i=1 B i with B i = {X i A} or B i = {X i A}. 8

9 Bayesian and Frequentist testability For B, V be two (disjoint) model subsets Definition 9.1 Uniform (or minimax) testability sup P n φ n 0, P B sup Q V Q n (1 φ n ) 0 Definition 9.2 Pointwise testability for all P B, Q V φ n P -a.s. 0, φ n Q-a.s. 1 Definition 9.3 Bayesian testability for Π-almost-all P B, Q V φ n P -a.s. 0, φ n Q-a.s. 1 9

10 Part II Bayesian testability and prior-a.s.-consistency

11 A posterior concentration inequality Lemma 11.1 Let (P, G ) be given. For any prior Π, any test function φ and any B, V G such that B V = and Π(B) > 0, B B P Π(V X) dπ(p ) P φ dπ(p ) + Q(1 φ) dπ(q) Corollary 11.2 Consequently, in i.i.d.-context, for any sequences (Π n ), (B n ), (V n ) such that B n V n = and Π n (B n ) > 0, we have, P n Π(V n X n ) dπ n (P B n ) 1 Π(B n ) ( B n P n φ n dπ n (P ) + V V n Q n (1 φ n ) dπ n (Q) ) 11

12 Martingale convergence Proposition 12.1 Let (P, G, Π) be given. following are equivalent, For any B, V G, the (i) There exist Bayesian tests (φ n ) for B versus V ; (ii) There exist tests (φ n ) such that, P n φ n dπ(p ) 0 + B V Qn (1 φ n ) dπ(q) 0, (iii) For Π-almost-all P B, Q V, Π(V X n ) P -a.s. 0, Π(B X n ) Q-a.s. 0 Remark 12.2 Interpretation distinctions between model subsets are Bayesian testable, iff they are picked up by the posterior asymptotically, if(f), the Bayes factor for B versus V is consistent 12

13 Prior-almost-sure consistency Theorem 13.1 Let Hausdorff P with Borel prior Π be given. Assume that for Π-almost-all P P and any open nbd U of P, there exist a B U with Π(B) > 0 and Bayesian tests (φ n ) for B versus P \ U. Then the posterior is consistent at Π-almost-all P P Remark 13.2 Let P be a Polish space and assume that all P P n (A) are Borel measurable. Then, for any prior Π, any Borel set V P is Bayesian testable versus P \ V. Corollary 13.3 Doob s theorem (1948), and much more! 13

14 Part III Frequentism

15 Le Cam s inequality Definition 15.1 For B G such that Π(B) > 0, the local prior predictive distribution is Pn Π B = P n dπ(p B). Remark 15.2 (Le Cam, unpublished (197?) and (1986)) Rewrite the posterior concentration inequality P n 0 Π(V n X n ) + P 0 n P Π B n n P n φ n dπ(p B n ) + Π(V n) Π(B n ) Q n (1 φ n ) dπ(q V n ) Remark 15.3 For some b n 0, B n = {P P : P n P n 0 b n}, a 1 n Π(B n ) Remark 15.4 Useful in parametric models but a considerable nuisance [sic] (Le Cam (1986)) in non-parametric context 15

16 Schwartz s theorem revisited Remark 16.1 Suppose that for all δ > 0, there is a B s.t. Π(B) > 0 and for all P B and large enough n P n 0 Π(V Xn ) e nδ P n Π(V X n ) then (by Fatou) for large enough m sup n m [ (P n 0 enδ P Π B n )Π(V X n ) ] 0 Theorem 16.2 Let P be a model with KL-prior Π; P 0 P. Let B, V G be given and assume that B contains a KL-neighbourhood of P 0. If there exist Bayesian tests for B versus V of exponential power then Π(V X n ) P 0 a.s. 0 Corollary 16.3 (Schwartz s theorem (1965)) 16

17 Remote contiguity Definition 17.1 Given (P n ), (Q n ) of prob msr s, Q n is contiguous w.r.t. P n (Q n P n ), if for any (ψ n ), ψ n : X n [0, 1] P n ψ n = o(1) Q n ψ n = o(1) Definition 17.2 Given (P n ), (Q n ) of prob msr s and a a n 0, Q n is a n -remotely contiguous w.r.t. P n (Q n a 1 n P n ), if for any sequence (ψ n ), ψ n : X n [0, 1] P n ψ n = o(a n ) Q n ψ n = o(1) Remark 17.3 Contiguity is stronger than remote contiguity note that Q n P n iff Q n a 1 n P n for all a n 0. Definition 17.4 Hellinger transform ρ n (α) = (dp n ) α (dq n ) 1 α 17

18 Le Cam s first lemma Lemma 18.1 Given (P n ), (Q n ) like above, Q n P n iff any of the following holds: (i) If T n P n 0, then T n Q n 0 (ii) Given ɛ > 0, there is a b > 0 such that Q n (dq n /dp n > b) < ɛ (iii) Given ɛ > 0, there is a c > 0 such that Q n Q n c P n < ɛ (iv) If dp n /dq n Q n -w. f along a subsequence, then P (f > 0) = 1 (v) If dq n /dp n P n -w. g along a subsequence, then Eg = 1 (vi) lim inf n φ n (α) 1 as α 1 18

19 Criteria for remote contiguity Lemma 19.1 Given (P n ), (Q n ), a n 0, Q n a 1 n P n if any of the following holds: (i) If, for all ɛ > 0, P n (T n > ɛ) = o(a n ), then T n Q n 0 (ii) Given ɛ > 0, there is a b > 0 such that Q n (dq n /dp n > b a 1 n ) < ɛ (iii) Given ɛ > 0, there is a c > 0 such that Q n Q n c a 1 n P n < ɛ (iv) If a 1 n dp Q n -w. n/dq n f along a subsequence, then P (f > 0) = 1 (v) If a n dq n /dp n P n -w. g along a subsequence, then Eg = 1 (vi) lim sup n inf α a n α ρ n (α) 0 19

20 Beyond Schwartz Theorem 20.1 Let (P, G ) with priors (Π n ) and (X 1,..., X n ) P n 0 be given. Assume there are B n, V n G and a n, b n 0, a n 0 s.t. (i) There exist Bayesian tests for B n versus V n of power a n, P n φ n dπ n (P ) + Q n (1 φ n ) dπ n (Q) a n B n V n (ii) The prior mass of B n is lower-bounded by b n, Π n (B n ) b n (iii) The sequence P n 0 satisfies P n 0 b na 1 n P Π n B n n Then Π n (V n X n ) P

21 Application to consistency I Remark 21.1 (Schwartz (1965)) Take P 0 P, and define V n = V := {P P : H(P, P 0 ) ɛ} B n = B := {P : P 0 log dp/dp 0 < ɛ 2 } with a n and b n of form exp( nk). With N(ɛ, P, H) <, the theorem proves Hellinger consistency with KL-priors. Remark 21.2 (Ghosal-Ghosh-vdVaart (2000)) Take P 0 P, and define V n = {P P : H(P, P 0 ) ɛ n } B n = B := {P : P 0 log dp/dp 0 < ɛ 2 n, P 0 log 2 dp/dp 0 < ɛ 2 n} with a n and b n of form exp( Knɛ 2 n ). With log N(ɛ n, P, H) nɛ 2 n, the theorem then proves Hellinger consistency at rate ɛ n with GGV-priors. Other B n are possible! (see Kleijn and Zhao (201x)) 21

22 Application to consistency II Remark 22.1 Dirichlet posteriors X n Π(P (A) X n ) are msb σ n (A) where σ n (A) is generated by products of the form n i=1 B i with B i = {X i A} or B i = {X i A}. Remark 22.2 (Freedman (1965), Ferguson (1973), Lo (1984),...) Take P 0 P, and define V n = V := {P P : (P 0 P )f 2ɛ} B n = B := {P : (P 0 P )f < ɛ} for some bounded, measurable f. Impose remote contiguity only for ψ n that are σ n (A)-measurable! Take a n and b n of form exp( nk). The theorem then proves T 1 consistency with a Dirichlet prior D α, if supp(p 0 ) supp(α). 22

23 Stochastic Block Model Definition 23.1 At step n, nodes belong to one of K n unobserved classes: θ i. We estimate θ = (θ 1,..., θ n ) Θ n upon observation of X n = {X ij : 1 1 < j n}. Edges X ij occur independently with probabilities Q ij (θ) = Q(θ i, θ j ). The (expected) degree is denoted λ n A SBM network realisation: n = 17, K n = 3, λ n

24 Testing in the Stochastic Block Model Assume there is a q s.t. 0 < q < Q ij < 1 q < 1 Lemma 24.1 For given θ, θ Θ n, there exists a test φ n s.t. P θ,n φ n P θ,n (1 φ n) e 8q(1 q) i<j (Q ij(θ) Q ij (θ )) 2 Lemma 24.2 For given, B n, V n Θ n, there exists a test φ n s.t. sup P θ,n φ n e 8q(1 q) a2 n+log #(V n ) θ B n sup P θ θ,n (1 φ n) e 8q(1 q) a2 n +log #(B n) V n where a 2 n = inf θ Bn inf θ Vn i<j(q ij (θ) Q ij (θ )) 2 Note: log #(V n ), log #(B n ) n log(k n ) 24

25 Consistency in the Stochastic Block Model Theorem 25.1 (Bickel, Chen (2009)) If K n = K, λ n / log(n), the ML estimator for θ is consistent. Consistency requires not a single mistake in θ = (θ 1, θ 2,...). Conjecture 25.2 Give Θ n a prior Π n s.t. Π({θ}) > 0 for all θ Θ n. Unless very special conditions for q, K n, λ n are satisfied, the posterior is not consistent. Theorem 25.3 (see also Choi, Wolfe, Airoldi (2011)) Given uniform priors Π n on Θ n, posteriors are consistent for hypotheses B n, V n Θ n, if, for all n large enough. 4q n (1 q n )a 2 n n log(k n ) 25

26 Open questions Tailfreeness is too strong for (weak or T 1 ) consistency; the Dirichlet example allows generalization. Can we show that Pitman-Yor is inconsistent? What about inverse Gaussian? Gibbs-type? Can we construct a family of consistent priors around Dirichlet? Which pairs of (FDR-type) hypotheses in SBM are testable and which are not? Can the method be applied to other network models? What is the extent of the usefulness of remote contiguity? 26

27 The weak and T 1 topologies Uniformity U n : basis is finite intersections of W n,f s W n,f = {(P, Q) : (P n Q n )f < ɛ}, (bnd msb f : X n [0, 1]) and U = n U n U H. Weak U C U 1 for (bnd cont f : X [0, 1]) T C T 1 T n T T H are completely regular (P, T 1 ) sep (P, T ) sep (P, T H ) sep P is dominated Any model P is pre-cpt in T C, T 1, T n, T : any complete P is cpt T C is metrizable, even Polish if X is Polish T 1, T n, T are metrizable iff X discrete 27

Semiparametric posterior limits

Statistics Department, Seoul National University, Korea, 2012 Semiparametric posterior limits for regular and some irregular problems Bas Kleijn, KdV Institute, University of Amsterdam Based on collaborations