Tail inequalities for additive functionals and empirical processes of geometrically ergodic Markov chains University of Warsaw Banff, June 2009
Geometric ergodicity Definition A Markov chain X = (X n ) n 1 on a Polish space X with a transition function P(, ): X B(X ) [0, 1] and a unique stationary distribution π is called geometrically ergodic if there exists ρ < 1 such that for every x X there exists a constant M(x), such that P n (x, ) π TV M(x)ρ n. If M(x) can be taken independent of x, then X is called uniformly ergodic.
Main question What is the tail decay of S := n f (X i ), i=1 where f : X R, E π f = 0, f a
Main question What is the tail decay of S := n f (X i ), i=1 where f : X R, E π f = 0, f a or S := sup f F n f (X i ), i=1 where F - a countable class of f s as above.
Regeneration method, split chain Definition A set C B(X ) is called a small set if there exists a probability measure ν on X and ε > 0 such that for all x C and A B(X ) and for all x X. P(x, A) εν(a) P x ( n>1 X n C) = 1
Regeneration method, split chain We can define a new chain ( X n, R n ) in the following way. Given X n = x, if x / C, draw X n+1 from P(x, ), set R n = 0
Regeneration method, split chain We can define a new chain ( X n, R n ) in the following way. Given X n = x, if x / C, draw X n+1 from P(x, ), set R n = 0 if x C, toss a coin with probability of heads equal to ε
Regeneration method, split chain We can define a new chain ( X n, R n ) in the following way. Given X n = x, if x / C, draw X n+1 from P(x, ), set R n = 0 if x C, toss a coin with probability of heads equal to ε heads: draw X n+1 from ν, set R n = 1
Regeneration method, split chain We can define a new chain ( X n, R n ) in the following way. Given X n = x, if x / C, draw X n+1 from P(x, ), set R n = 0 if x C, toss a coin with probability of heads equal to ε heads: draw X n+1 from ν, set R n = 1 tails: draw X n+1 from and set R n = 0. P(x, ) εν( ). 1 ε
Regeneration method, split chain We can define a new chain ( X n, R n ) in the following way. Given X n = x, if x / C, draw X n+1 from P(x, ), set R n = 0 if x C, toss a coin with probability of heads equal to ε heads: draw X n+1 from ν, set R n = 1 tails: draw X n+1 from P(x, ) εν( ). 1 ε Since and set R n = 0. P(x, ) εν( ) εν( ) + (1 ε) = P(x, ), 1 ε X n is again a Markov chain with transition function P (and we will identify it with X n ).
Regeneration method, split chain Let T 1 = inf{n > 0: R n = 1}, T i+1 = inf{n > 0: R T1 +...+T i +n = 1} and Y 0 = (X 1,..., X T1 ) Y i = (X T1 +...+T i +1,..., X T1 +...+T i+1 ). Fact Blocks Y i, i 0 are independent, blocks Y i, i 1 are i.i.d If f : X R and Z i = Z i (f ) = T 1 +...+T i+1 j=t 1 +...+T i +1 then for i 1, EZ i = (ET 2 ) 1 E π f. f (X j ),
Regeneration method, summary We can write f (X 1 ) +... + f (X n ) = Z 0 +... + Z N + where N = sup{i N: T 1 +... + T i+1 n}, Z i = Z i (f ) = Z 0 = Z 0 (f ) = T 1 n i=1 T 1 +...+T i+1 i=t 1 +...+T i +1 f (X i ), n i=t 1 +...+T N+1 +1 f (X i ), i 1, and use the i.i.d. theory to analyze additive functionals. f (X i ),
Regeneration method, summary The idea goes back to Nummelin (early 80 s), developed subsequently by Meyn and Tweedie for proving limit theorems. A sample recent result is Theorem (Bednorz, Latała, Łatuszyński) If E π f 2 <, E π f = 0 then f (X 1 ) +... + f (X n ) n converges weakly iff EZ 1 (f ) 2 <. The limiting distribution is N (0, σ 2 ), where σ 2 = Var(Z 1 )/(ET 2 ). For concentration inequalities, the regeneration method has been used e.g. by Clémençon (2001) and Douc, Guillin, Moulines (2008).
Drift conditions Theorem (Meyn, Tweedie) A Markov chain X n is geometrically ergodic iff there exists V : X [1, ) and constants λ < 1 and K <, such that PV (x) = X V (y)p(x, dy) Theorem (Meyn & Tweedie, Baxendale) { λv (x) for x / C, K for x C If X 1 µ, E µ V < then T 1 ψ1, T 2 ψ1 <. Corollary Consider a set F of functions f : X R, such that for all f F, f a. Then sup Z i (f ) ψ1 Caτ, f F where τ = max( T 1 ψ1, T 2 ψ1 ).
Main theorem a single function Theorem (R.A. 2008) Consider a function f : X R, such that f a and E π f = 0. Define also the random variable Then for all t > 0, S = n f (X i ). i=1 ( ) ( P S > t K exp 1 ( K min t 2 t )) n(et 2 ) 1, VarZ 1 τ 2. a log n Remark: (ET 2 ) 1 VarZ 1 is the variance of the limiting normal variable.
Main theorem empirical processes Theorem (R.A. 2008) Consider a countable class F of measurable functions f : X R, such that f a and E π f = 0. Define the random variable S = sup f F n f (X i ) i=1 and the asymptotic weak variance Then for all t 1, σ 2 = sup VarZ 1 (f )/ET 2. f F ( ) ( P S K ES + t K exp 1 ( t 2 K min nσ 2, t )) τ 3 (ET 2 ) 1. a log n
Sketch of the proof Recall that f (X 1 ) +... + f (X n ) = Z 0 + Z 1 +... + Z N + n f (X i ), i=t 1 +...+T N+1 +1
Sketch of the proof Recall that f (X 1 ) +... + f (X n ) = Z 0 + Z 1 +... + Z N + n f (X i ), i=t 1 +...+T N+1 +1 Z 0 ψ1 Caτ = P( Z 0 t) 2e ct/(aτ).
Sketch of the proof Recall that f (X 1 ) +... + f (X n ) = Z 0 + Z 1 +... + Z N + n f (X i ), i=t 1 +...+T N+1 +1 Z 0 ψ1 Caτ = P( Z 0 t) 2e ct/(aτ). One can easily show that (n (T 1 +... + T N+1 )) + ψ1 Cτ log τ This allows to handle the last term.
Sketch of the proof Recall that f (X 1 ) +... + f (X n ) = Z 0 + Z 1 +... + Z N + n f (X i ), i=t 1 +...+T N+1 +1 Z 0 ψ1 Caτ = P( Z 0 t) 2e ct/(aτ). One can easily show that (n (T 1 +... + T N+1 )) + ψ1 Cτ log τ This allows to handle the last term. What remains is Z 1 +... + Z N - a sum of random length.
By the LLN, N n/(et 2 ) (quantitative bounds by Bernstein s ψ 1 inequality), so with high probability Z 1 +... + Z N max i Cn/(ET 2 ) Z 1 +... + Z i and we can use Levy-type inequality due to Montgomery-Smith.
By the LLN, N n/(et 2 ) (quantitative bounds by Bernstein s ψ 1 inequality), so with high probability Z 1 +... + Z N max i Cn/(ET 2 ) Z 1 +... + Z i and we can use Levy-type inequality due to Montgomery-Smith. We are left with Z 1 (f ) +... + Z Cn/(ET2 )(f ) where Z i are i.i.d. and we control VarZ i (f ) and sup f F Z i (f ) ψ1.
Inequality for independent variables Consider now X 1,..., X n independent r.v. s F a countable class of measurable functions f, s.t. Ef (X i ) = 0 and for some α (0, 1], sup f (X i ) ψα <. f S = sup f F σ 2 = sup f F n f (X i ). i=1 n Ef (X i ) 2. i=1
Inequality for independent variables Theorem (R.A. 2008) Under the above assumptions, for all 0 < η < 1 and δ > 0, t 0, P(S (1 + η)es + t), P(S (1 η)es t) ( t 2 ) exp 2(1 + δ)σ 2 ( ( t ) α ) + 3 exp C max i sup f F f (X i ) ψα where C = C(α, η, δ).
Back to Markov chains Since max i k Y i ψ1 C max i k Y i ψ1 log k, we can apply the result for independent variables to Z 1 (f ) +... + Z Cn/(ET2 )(f ).
Back to Markov chains Since max i k Y i ψ1 C max i k Y i ψ1 log k, we can apply the result for independent variables to Z 1 (f ) +... + Z Cn/(ET2 )(f ). In the empirical processes setting one also has to bound in terms of E sup Z 1 (f ) +... + Z Cn/(ET2 )(f ) f F E sup f (X 1 ) +... + f (X n ) f F (optional sampling + concentration for N)
How to handle the independent case? truncate and re-center the variables
How to handle the independent case? truncate and re-center the variables use Talagrand s inequality for the bounded part
How to handle the independent case? truncate and re-center the variables use Talagrand s inequality for the bounded part use another Talagrand s inequality to handle the unbounded part: Theorem (Talagrand) For independent, centered Banach space valued variables Z i and α (0, 1], Z 1 +... + Z n ψα C α ( Z 1 +... + Z n 1 + max i n Z i ψα ). In our case the Banach space is l (F)
How to handle the independent case? truncate and re-center the variables use Talagrand s inequality for the bounded part use another Talagrand s inequality to handle the unbounded part: Theorem (Talagrand) For independent, centered Banach space valued variables Z i and α (0, 1], Z 1 +... + Z n ψα C α ( Z 1 +... + Z n 1 + max i n Z i ψα ). In our case the Banach space is l (F) Truncation at the level of E max i sup f f (X i ) makes the unbounded part satisfy Z 1 +... + Z n 1 CE max i n Z i C α max i n Z i ψα. (Hoffman-Jørgensen inequality).
Optimality In the ineq. for independent variables (α = 1), log n in the exponent is optimal: P(X i = ±r) = 1 2 e r, P(X i = 0) = 1 e r, r. This example can be emulated with Markov chains which gives optimality of log n.
Final comments The same scheme can be applied under the assumption that T i ψα < (α 1). Unbounded functions: if f ψα(π) < then Z i ψα/2 <, which together with some additional arguments gives inequalities for the chain started from ν (W. Bednorz, R.A., unpublished), Using regeneration one can also obtain a bounded difference type inequality for symmetric functions (recovering e.g. Hoeffding inequalities for U-statistics in the Markov setting).
Some open (???) questions Can one get estimates of the form ( P(S (1+η)ES+t) exp t 2 ) (2 + δ)(et 2 ) 1 VarZ 1 What about drift conditions on f, guaranteeing that Z i (f ) ψ1 <? Important for applications to MCMC algorithms partial results by W. Bednorz (unpublished) Is there a nice characterizations of Orlicz functions for which a Hoffman-Joergensen type inequality holds? M. Talagrand > characterization for functions of the form ψ(x) = exp(xξ(x)) (where x large enough, ξ nondecreasing): ξ(e u ) Lξ(u) for u large enough. +K (η, δ)...?
Thank you