Differential Stein operators for multivariate continuous distributions and applications

Size: px

Start display at page:

Download "Differential Stein operators for multivariate continuous distributions and applications"

Bertina King
5 years ago
Views:

1 Differential Stein operators for multivariate continuous distributions and applications Gesine Reinert A French/American Collaborative Colloquium on Concentration Inequalities, High Dimensional Statistics and Stein s Method July 4th, 2017 Joint work with Guillaume Mijoule and Yvik Swan (Liége) 1 / 41

2 Stein s method Outline 1 Stein s method 2 The score function and the Stein kernel 3 Higher dimensions 4 Stein operators T p F = div(f p)/p 5 Last remarks 2 / 41

3 Stein s method Stein s method in a nutshell For µ a target distribution, with support I: 1 Find a suitable operator A (called Stein operator) and a wide class of functions F(A) (called Stein class) such that X µ if and only if for all functions f F(A), EAf (X ) = 0. 2 Let H(I) be a measure-determining class on I. For each h H find a solution f = f h F(A) of the h(x) Eh(X ) = Af (x), where X µ. If the solution exists and if it is unique in F(A) then we can write f (x) = A 1 (h(x) Eh(X )). We call A 1 the inverse Stein operator (for µ). 3 / 41

4 Stein s method Example: mean zero normal Stein (1972, 1986), see also Chen, Goldstein, Shao 2011 Z N (0, σ 2 ) if and only if for all smooth functions f, EZf (Z) = σ 2 Ef (Z). Given a test function h, let Z N (0, σ 2 ); the Stein equation is σ 2 f (w) wf (w) = h(w) Eh(Z) which has as unique bounded solution f (y) = 1 σ 2 ey 2 /2σ 2 y (h(x) Eh(Z)) e x2 /2σ 2 dx. 4 / 41

5 Stein s method Example: the sum of independent random variables X 1,..., X n indep t mean zero, Var = 1 n ; W = n i=1 X i. Then Ef (W ) EWf (W ) n = Ef (W ) EX i f (W ) = Ef (W ) = 1 n i=1 n EX i f (W X i ) + i=1 n i=1 n ( Ef (W ) Ef (W X i ) ) + R; i=1 EX 2 i f (W X i ) + R bound this expression by Taylor expansion to give that for any smooth h ( ) Eh(W ) Nh h 2 n + E X 3 n i. i=1 Note: nothing goes to infinity. 5 / 41

6 Stein s method Comparison of distributions Let X and Y have distributions µ X and µ Y with Stein operators A X and A Y, so that F(A X ) F(A Y ) and choose H(I) such that all solutions f of the Stein equation belong to this intersection. Then and Eh(X ) Eh(Y ) = EA Y f (X ) = EA Y f (X ) EA X f (X ) sup Eh(X ) Eh(Y ) sup EA X f (X ) EA Y f (X ). h H(I) f F(A X ) F(A Y ) If H(I) is the set of all Lipschitz-1-functions then the resulting distance is d W, the Wasserstein distance. For examples: Holmes (2004), Eichelsbacher and R. (2008), Döbler (2012), Ley, Swan and R. 2015, / 41

7 The score function and the Stein kernel Outline 1 Stein s method 2 The score function and the Stein kernel 3 Higher dimensions 4 Stein operators T p F = div(f p)/p 5 Last remarks 7 / 41

8 The score function and the Stein kernel A Stein operator for continuous real-valued variables Let X be continuous having pdf p with support I = [a, b] R. The Stein class of X is the class F(p) of functions f : R R such that (i) x f (x)p(x) is differentiable on R (ii) (fp) is integrable and (fp) = 0. To p associate the Stein operator T p : T p f = (fp) p. (Stein 1986, Stein et al. 2004, Ley and Swan 2013) By the product rule, E [ g (X )f (X ) ] = E [g(x )T p f (X )] for all f F(p) and for all differentiable functions g such that (gfp) dx = 0, and g fp dx < ; we say that g dom(( ), p, f ). 8 / 41

9 The score function and the Stein kernel Stein characterisations Let Y be continuous with density q, and same support as X. 1 Suppose that q p is differentiable. Take g f F(p)dom(( ), p, f ) such that g is p-a.s. never 0 and g q p is differentiable. Then Y D = X if and only if E [ f (Y )g (Y ) ] = E [g(y )T p f (Y )] for all f F(p). 2 Let f F(p) be p-a.s. never zero and assume that dom(( ), p, f ) is dense in L 1 (p). Then Y D = X if and only if E [ f (Y )g (Y ) ] = E [g(y )T p f (Y )] for all g dom(( ), p, f ). 9 / 41

10 The score function and the Stein kernel The inverse Stein operator Let F (0) (p) be the class of mean zero smooth test functions; the inverse Stein operator Tp 1 : F (0) (p) F(p) is The equation Tp 1 h(x) = 1 x p(y)h(y)dy = 1 b p(y)h(y)dy. p(x) a p(x) x h(x) Eh(X ) = f (x)g (x) + g(x)t p f (x), x I, is a Stein equation for the target p. Solutions of this equation (for h such that a solution exists) are pairs of functions (f, g) such that fg = Tp 1 (h E p h). Although fg is unique, the individual f and g are not. 10 / 41

11 The score function and the Stein kernel f (x)g (x) + g(x)t p f (x): Special Stein operators Our general Stein operator is an operator on pairs of functions (f, g); A(f, g)(x) = T p (fg)(x) = f (x)g (x) + g(x)t p f (x). Suppose that 1 F(p). Then taking f (x) = 1 we get A p g(x) = g (x) + g(x)ρ(x) with ρ(x) = T p 1(x) = p (x) p(x) the so-called score function of p; see for example Stein (2004). If X has finite mean ν taking f (x) = Tp 1 (ν x) we get A X g(x) = τ(x)g (x) + (ν x)g(x) with τ = Tp 1 (ν Id) the Stein kernel of p ; see Stein (1986) and Cacoullos et al. (1992). 11 / 41

12 The score function and the Stein kernel Example: Normal In the example of a N (0, σ 2 ) random variable, which contrasts with T N f (x) = f (x) + 1 xf (x) σ2 σ 2 f (x) xf (x), the standard Stein operator for this case. The score function is x σ 2. The Stein kernel is τ(x) = σ 2 giving the standard Stein operator. 12 / 41

13 Higher dimensions Outline 1 Stein s method 2 The score function and the Stein kernel 3 Higher dimensions 4 Stein operators T p F = div(f p)/p 5 Last remarks 13 / 41

14 Higher dimensions Notation Let e 1,..., e d be the canonical basis for Cartesian coordinates in R d. ( ) T The gradient for φ : R d R is φ = φ x 1,..., φ x d = d i=1 ( iφ)e i. The gradient of a vector field v : R d R r : x (v 1 (x), v 2 (x),..., v r (x)) (a line vector) is the matrix v = ( ( ) ) vj v 1 v 2 v r =. x i If r = d then the divergence of v is 1 i d,1 j r div(v) = v T = d i=1 v i x i = Tr ( v), with Tr the trace operator and x y = x T y = x, y the Euclidean scalar product between x and y. 14 / 41

15 Higher dimensions More generally, the divergence of a q d tensor field F 1 (x) F 11 (x)... F 1d (x) F : R d R q R d : x F(x) =. =..... F q (x) F q1 (x)... F qd (x) is d div(f 1 ) F T 1 i=1 div(f) := F =. =. =. div(f q ) F T q d i=1 F 1i x i F qi x i. The divergence maps matrix-valued functions F : R d R q R d onto vector valued functions div(f) : R d R q. 15 / 41

16 Higher dimensions Product rule for divergence Let F : R d R q R d be a q d tensor field and φ : R d R. Then, under appropriate regularity conditions, div(fφ) = div(f) φ + F φ. Similarly if F is a q d tensor field and G is a d d tensor field then FG is a q d vector field and for j = 1,..., q. (div (FG)) j = F j div(g) + Tr (grad (F j ) G) 16 / 41

17 Higher dimensions What is known: multivariate normal Y R d is a multivariate normal MVN (0, Σ) if and only if EY t f (Y ) = E t Σ f (Y ), for all smooth f : R d R. Assume that h : R d R has 3 bounded derivatives. Then, if Σ R d d is symmetric and positive definite, and Z MVN (0, Σ), there is a solution f : R d R to the Stein equation t Σ f (w) w t f (w) = h(w) Eh(Σ 1/2 Z), which holds for every w R d. 17 / 41

18 Higher dimensions The Mehler formula To solve t Σ f (w) w t f (w) = h(w) Eh(Σ 1/2 Z), t [0, 1] put Z w,t = tw + 1 t Σ 1/2 Z, then f (w) = t [Eg(Z w,t) Eg(Σ 1/2 Z)]dt w R d, for is a solution to the Stein equation. This solution f satisfies the bounds k f (w) k j=1 w 1 k h(w) i j k k j=1 w i j for every w R d. (Barbour 1990, Götze 1993, Rinott and Rotar 1996, Goldstein and Rinott 1996, R. + Röllin 2007, Meckes 2009, Chen, Goldstein and Shao 2011) 18 / 41

19 Higher dimensions What is known: strictly log-concave (Mackey and Gorham 2016) For continuous p on R d, such that log p C 4 (R d ) is k-strictly concave, the operator Af (w) = 1 2 f (w), log p(w) + 1 f (w) 2 is the generator of an overdamped Langevin diffusion. The Stein equation is solved by f (w) = Af (w) = h(w) E p h 0 [E p h(z) Eh(Z w,t )]dt with (Z w,t ) t 0 the overdamped Langevin diffusion with generator A and Z w,0 = w. The first three derivatives of f can be bounded in terms of same and lower order derivatives of h. 19 / 41

20 Higher dimensions What is known: Score functions (Nourdin et al. 2013, 2014) Let X R d have mean 0 and p.d.f. p : R d R. The score of p is the random vector ρ p (X ) inr d which satisfies Eρ p (X )φ(x ) = E φ(x ) for all φ C c (R d ). If p has a score, then it is uniquely defined through ρ p (x) = log p(x). 20 / 41

21 Higher dimensions What is known: Stein kernels (Nourdin et al. 2013, 2014) A random d d matrix τ p (X ) such that Eτ p (X ) φ(x ) = EX φ(x ) for all φ C c (R d ) is called a strong Stein kernel for p. Ledoux et al. 2015: τ p (X ) is a weak Stein kernel if for all φ C c (R d ) ETr(τ p (X ){Hess(φ(X ))} T ) = EX φ(x ). There is no reason to assume uniqueness for the Stein kernel, or existence. If τ 1 and τ 2 are two Stein kernels for p, then for all φ C c (R d ), E(τ 1 (X ) τ 2 (X )) φ(x ) = 0; then div(p(x)(τ 1 (x) τ 2 (x)) = 0 from which we get uniqueness only in the one-dimensional case. 21 / 41

22 Stein operators T pf = div(f p)/p Outline 1 Stein s method 2 The score function and the Stein kernel 3 Higher dimensions 4 Stein operators T p F = div(f p)/p 5 Last remarks 22 / 41

23 Stein operators T pf = div(f p)/p The general multivariate density case Let X R d have pdf p : R d R with respect to the Lebesgue measure on R d. Let Ω be the support of p. 1 Let q N 0. The q-stein class for X is the class F q (X ) of all F : R d R q R d such that pf is (i) differentiable in the sense that its gradient exists, (ii) div(pf) is integrable, on Ω (iii) Ω div(pf) = 0. 2 We propose as Stein operator of p the operator T p F = div(f p) p acting on test functions F : R d R q R d F q (X ). If F F q (X ) then T X F : R d R q. 23 / 41

24 Stein operators T pf = div(f p)/p Stein type integration by parts To each F : R d R q R d F q (p) we associate dom(, p, F) the vector space of functions g : R d R such that F g F q (p) and F g L 1 (p). Proposition: E p [F g] = E p [(T X F) g] for all F F q (p) and all g dom(, p, F). Proof: Apply the product rule for divergence, div(fφ) = div(f) φ + F φ, to (Fφ)p with φ = g, to show that for T p F = div(f p) p, T p (F g) = (T p F) g + F g, and then take expectations, using that Ω div(fgp) = 0 and hence the l.h.s has mean / 41

25 Stein operators T pf = div(f p)/p Stein operators As in the 1-dimensional case, our Stein operators depend on two test functions, F and g, and are of the form obtained by T p (F g) = (T p F) g + F g either by fixing F and considering g as the (scalar-valued) test functions, or fixing g and considering F as the (matrix-valued) test functions. 25 / 41

26 Stein operators T pf = div(f p)/p T p (F g) = (T p F) g + F g: F = I d fixed Suppose that the identity matrix I d F d (p) (e.g. if p is log-concave and vanishes at Ω). Then T p I d = log p = ρ p, and the Stein operator is A p g : R d R d, A p g = T p (Ig) = g + ρ X g acting on g : R d R belonging to dom(, p, I d ). 26 / 41

27 Stein operators T pf = div(f p)/p T p (F g) = (T p F) g + F g: F = τ p fixed Let X have mean ν and suppose that there exists a d d matrix-valued function F = τ p (a Stein kernel) satisfying at all x. Then A p g : R d R d, T p (τ τ p )(x) = (x ν) A p g(x) = T p (τ τ p g)(x) = (x ν)g(x) + τ X (x) g(x) acting on differentiable functions g : R d R belonging to dom(, p, τ p ). 27 / 41

28 Stein operators T pf = div(f p)/p T p (F g) = (T p F) g + F g: g = 1 fixed For g : R d R, g(x) = 1 we obtain for F F q (p), A p F(x) = T p F(x) R q, vector-valued. The Stein equation for a zero mean function h : R d R q is then A p F(x) = div(fp) (x) = h(x) p which gives div(fp)(x) = p(x)h(x). There is not a unique solution. If q = d then we could choose a solution F such that F i,j = 0 for i j. 28 / 41

29 Stein operators T pf = div(f p)/p Special case: q = 1 Let v = (v 1,..., v n ) : R d R d be a vector field in the 0-Stein class for p : R d R. Then our Stein operator of p is T p v = = ( v)p + v p p d v i d i p + v i x i p. i=1 i=1 This is a function from R d to R. Take as vector field v = f for a smooth function f : R d R. This choice gives A p (f ) = T p v = f + log p, f, interpreted as operator on f rather than v. This is the operator considered by Mackey and Gorham 2016, except for a factor / 41

30 Stein operators T pf = div(f p)/p T p (F g) = (T p F) g + F g: g = p 1 fixed For g : R d R, g(x) = 1/p(x) we obtain for F F q (p), A p F = div(fp) p 2 + F (1/p) R q, vector-valued. The Stein equation for a zero mean function h : R d R q is then div(fp) p 2 (x) + F (1/p)(x) = h(x) which gives div(f)(x) = p(x)h(x). Again there is not a unique solution. If q = d then we could choose a solution F such that F i,j = 0 for i j. 30 / 41

31 Stein operators T pf = div(f p)/p Example: multivariate normal Consider Z MVN d (0, Σ). Then ρ p (x) = Σ 1 x and τ p (z) = Σ. (linear score and constant Stein kernel). These lead to the Stein operator for g : R d R A p g(x) = Σ g(x) g(x)x. 31 / 41

32 Stein operators T pf = div(f p)/p Example: elliptical distributions A d-random vector has multivariate elliptical distribution E d (µ, Σ, φ) if its density is given by ( ) 1 p(x) = κ Σ 1/2 φ 2 (x µ)t Σ 1 (x µ) on R d, for φ a smooth function and κ the normalising constant. Elliptical distributions have the score function and ρ p (x) = Σ 1 x φ (x t Σ 1 x/2) φ(x t Σ 1 x/2), ( 1 τ p (x) = φ(x t Σ 1 x/2) + x t Σ 1 x/2 φ(u)du is a strong Stein kernel for p (Landsman, Vanduffel, Yao 2014). ) Σ 32 / 41

33 Stein operators T pf = div(f p)/p Bounds on the solution of the Stein equation So we have Stein equations, but when are the solutions well behaved? In the multivariate normal case: Mehler formula. In the case of strictly log-concave distributions: overdamped Langevin diffusion. The bounds will be distribution-specific. 33 / 41

34 Stein operators T pf = div(f p)/p Bounds using a Poincaré constant We say that C p is a Poincaré constant associated to µ X if for every smooth function ϕ L 2 (µ X ) such that Eϕ(X ) = 0, Eϕ 2 (X ) C p E ϕ(x ) 2. For example, when X has k-log-concave density, then the law of X satisfies a Poincaré inequality with C p = 1/k. Using the Lax-Milgram theorem we can show the following result. Let h be a smooth, 1-Lipschitz function. Let X be a random vector with density p, and assume C p < is a Poincaré constant for p(x)dx. Then we prove that there exists a weak solution u to u + log p u = h p(h), such that u 2 p C 2 p. 34 / 41

35 Stein operators T pf = div(f p)/p Application: nested densities The Wasserstein distance between (the distributions of) X and Y is d W (X, Y ) = sup Eh(X ) Eh(Y ). h Lip(1) Compare the Wasserstein distance between P 1 and P 2 on R d, with densities p 1, assumed k-log concave, and p 2 = π 0 p 1. Put A 1 u = 1 2 log p 1 u u, and Then A 2 u = 1 2 log p 2 u u. A 2 u = A 1 u log π 0 u. 35 / 41

36 Stein operators T pf = div(f p)/p Let h : R d R be a 1-Lipschitz function, and u h a solution to A 1 u h = h hp 1. Let X 1 (X 2 ) have distribution P 1 (P 2 ). Then as A 2 u = A 1 u log π 0 u, E[h(X 2 )] E[h(X 1 )] = E[A 1 u h (X 2 )] [ = E A 2 u h (X 2 ) 1 ] 2 log π 0(X 2 ) u h (X 2 ) Using the Poincaré bounds we obtain = 1 2 E [ log π 0(X 2 ) u h (X 2 )]. d W (X 1, X 2 ) 1 k E[ π 0(X 1 ) ]. 36 / 41

37 Stein operators T pf = div(f p)/p Example: Copulas Let (V 1, V 2 ) be a 2-dimensional random vector, such that the marginals V 1 and V 2 have uniform U[0, 1] distribution. The copula of (V 1, V 2 ) is C(x 1, x 2 ) = P[V 1 x 1, V 2 x 2 ], (x 1, x 2 ) [0, 1] 2 and we assume that c = 2 x 1 x 2 C exists. Let (U 1, U 2 ) be independent U[0, 1]. The copula of (U 1, U 2 ) is (x 1, x 2 ) x 1 x 2. Payne 1960: an optimal Poincaré constant for U[0, 1] 2 is C p = 2/π 2. Now we can show: d W [(V 1, V 2 ), (U 1, U 2 )] 2 π 2 [0,1] 2 c(x 1, x 2 ) 2 dx 1 dx / 41

38 Stein operators T pf = div(f p)/p Example: the effect of the prior on the posterior Consider a normal model with mean θ R d and positive definite covariance matrix Σ. The likelihood of θ given a sample (x 1,..., x n ) is ( ) (2π) nd/2 det(σ) n/2 exp 1 n (x i θ) T Σ 1 (x i θ). 2 We want to compare the posterior distribution P 1 = N ( x, n 1 Σ) of θ with uniform prior with the posterior P 2 with normal prior with parameters (µ, Σ 2 ); Σ 2 is assumed positive definite. i=1 38 / 41

39 Stein operators T pf = div(f p)/p The operator norm of a matrix A is A = sup x =1 Ax. The nomal density p 1 is n/ Σ -log concave. Moreover P 2 = N ( µ, Σ n ) with After some calculation we find µ = µ + n Σ n Σ 1 ( x µ) Σ n = (Σ nσ 1 ) 1. d W (P 1, P 2 ) Σ (Σ + nσ 2 ) 1 x µ 2Γ(d/2 + 1/2) Σ + (Σ 2 + nσ 2 Σ 1 Σ 2 ) 1/2. Γ(d/2) n The closer x is to µ, the smaller the bound. The influence of Σ 2 vanishes as n. 39 / 41

40 Last remarks Outline 1 Stein s method 2 The score function and the Stein kernel 3 Higher dimensions 4 Stein operators T p F = div(f p)/p 5 Last remarks 40 / 41

41 Last remarks Last remarks Solving and bounding the Stein equation is crucial for applying the method. Our framework gives a large (indeed infinite) choice for Stein equations to choose from. The effect of the prior on the posterior will be studied in more detail. We are thinking about the multivariate discrete case, too. Note that Barbour et al gives an approximation by a discretised multivariate normal, using Markov process arguments. 41 / 41

Stein s Method: Distributional Approximation and Concentration of Measure

Stein s Method: Distributional Approximation and Concentration of Measure Larry Goldstein University of Southern California 36 th Midwest Probability Colloquium, 2014 Stein s method for Distributional