Tail inequalities for additive functionals and empirical processes of. Markov chains

Tail inequalities for additive functionals and empirical processes of geometrically ergodic Markov chains University of Warsaw Banff, June 2009

Geometric ergodicity Definition A Markov chain X = (X n ) n 1 on a Polish space X with a transition function P(, ): X B(X ) [0, 1] and a unique stationary distribution π is called geometrically ergodic if there exists ρ < 1 such that for every x X there exists a constant M(x), such that P n (x, ) π TV M(x)ρ n. If M(x) can be taken independent of x, then X is called uniformly ergodic.

Main question What is the tail decay of S := n f (X i ), i=1 where f : X R, E π f = 0, f a

Main question What is the tail decay of S := n f (X i ), i=1 where f : X R, E π f = 0, f a or S := sup f F n f (X i ), i=1 where F - a countable class of f s as above.

Regeneration method, split chain Definition A set C B(X ) is called a small set if there exists a probability measure ν on X and ε > 0 such that for all x C and A B(X ) and for all x X. P(x, A) εν(a) P x ( n>1 X n C) = 1

Regeneration method, split chain We can define a new chain ( X n, R n ) in the following way. Given X n = x, if x / C, draw X n+1 from P(x, ), set R n = 0

Regeneration method, split chain We can define a new chain ( X n, R n ) in the following way. Given X n = x, if x / C, draw X n+1 from P(x, ), set R n = 0 if x C, toss a coin with probability of heads equal to ε heads: draw X n+1 from ν, set R n = 1 tails: draw X n+1 from P(x, ) εν( ). 1 ε Since and set R n = 0. P(x, ) εν( ) εν( ) + (1 ε) = P(x, ), 1 ε X n is again a Markov chain with transition function P (and we will identify it with X n ).

Regeneration method, split chain Let T 1 = inf{n > 0: R n = 1}, T i+1 = inf{n > 0: R T1 +...+T i +n = 1} and Y 0 = (X 1,..., X T1 ) Y i = (X T1 +...+T i +1,..., X T1 +...+T i+1 ). Fact Blocks Y i, i 0 are independent, blocks Y i, i 1 are i.i.d If f : X R and Z i = Z i (f ) = T 1 +...+T i+1 j=t 1 +...+T i +1 then for i 1, EZ i = (ET 2 ) 1 E π f. f (X j ),

Regeneration method, summary We can write f (X 1 ) +... + f (X n ) = Z 0 +... + Z N + where N = sup{i N: T 1 +... + T i+1 n}, Z i = Z i (f ) = Z 0 = Z 0 (f ) = T 1 n i=1 T 1 +...+T i+1 i=t 1 +...+T i +1 f (X i ), n i=t 1 +...+T N+1 +1 f (X i ), i 1, and use the i.i.d. theory to analyze additive functionals. f (X i ),

Regeneration method, summary The idea goes back to Nummelin (early 80 s), developed subsequently by Meyn and Tweedie for proving limit theorems. A sample recent result is Theorem (Bednorz, Latała, Łatuszyński) If E π f 2 <, E π f = 0 then f (X 1 ) +... + f (X n ) n converges weakly iff EZ 1 (f ) 2 <. The limiting distribution is N (0, σ 2 ), where σ 2 = Var(Z 1 )/(ET 2 ). For concentration inequalities, the regeneration method has been used e.g. by Clémençon (2001) and Douc, Guillin, Moulines (2008).

Drift conditions Theorem (Meyn, Tweedie) A Markov chain X n is geometrically ergodic iff there exists V : X [1, ) and constants λ < 1 and K <, such that PV (x) = X V (y)p(x, dy) Theorem (Meyn & Tweedie, Baxendale) { λv (x) for x / C, K for x C If X 1 µ, E µ V < then T 1 ψ1, T 2 ψ1 <. Corollary Consider a set F of functions f : X R, such that for all f F, f a. Then sup Z i (f ) ψ1 Caτ, f F where τ = max( T 1 ψ1, T 2 ψ1 ).

Main theorem a single function Theorem (R.A. 2008) Consider a function f : X R, such that f a and E π f = 0. Define also the random variable Then for all t > 0, S = n f (X i ). i=1 ( ) ( P S > t K exp 1 ( K min t 2 t )) n(et 2 ) 1, VarZ 1 τ 2. a log n Remark: (ET 2 ) 1 VarZ 1 is the variance of the limiting normal variable.

Main theorem empirical processes Theorem (R.A. 2008) Consider a countable class F of measurable functions f : X R, such that f a and E π f = 0. Define the random variable S = sup f F n f (X i ) i=1 and the asymptotic weak variance Then for all t 1, σ 2 = sup VarZ 1 (f )/ET 2. f F ( ) ( P S K ES + t K exp 1 ( t 2 K min nσ 2, t )) τ 3 (ET 2 ) 1. a log n

Sketch of the proof Recall that f (X 1 ) +... + f (X n ) = Z 0 + Z 1 +... + Z N + n f (X i ), i=t 1 +...+T N+1 +1

Sketch of the proof Recall that f (X 1 ) +... + f (X n ) = Z 0 + Z 1 +... + Z N + n f (X i ), i=t 1 +...+T N+1 +1 Z 0 ψ1 Caτ = P( Z 0 t) 2e ct/(aτ).

Sketch of the proof Recall that f (X 1 ) +... + f (X n ) = Z 0 + Z 1 +... + Z N + n f (X i ), i=t 1 +...+T N+1 +1 Z 0 ψ1 Caτ = P( Z 0 t) 2e ct/(aτ). One can easily show that (n (T 1 +... + T N+1 )) + ψ1 Cτ log τ This allows to handle the last term.

By the LLN, N n/(et 2 ) (quantitative bounds by Bernstein s ψ 1 inequality), so with high probability Z 1 +... + Z N max i Cn/(ET 2 ) Z 1 +... + Z i and we can use Levy-type inequality due to Montgomery-Smith. We are left with Z 1 (f ) +... + Z Cn/(ET2 )(f ) where Z i are i.i.d. and we control VarZ i (f ) and sup f F Z i (f ) ψ1.

Inequality for independent variables Consider now X 1,..., X n independent r.v. s F a countable class of measurable functions f, s.t. Ef (X i ) = 0 and for some α (0, 1], sup f (X i ) ψα <. f S = sup f F σ 2 = sup f F n f (X i ). i=1 n Ef (X i ) 2. i=1

Inequality for independent variables Theorem (R.A. 2008) Under the above assumptions, for all 0 < η < 1 and δ > 0, t 0, P(S (1 + η)es + t), P(S (1 η)es t) ( t 2 ) exp 2(1 + δ)σ 2 ( ( t ) α ) + 3 exp C max i sup f F f (X i ) ψα where C = C(α, η, δ).

Back to Markov chains Since max i k Y i ψ1 C max i k Y i ψ1 log k, we can apply the result for independent variables to Z 1 (f ) +... + Z Cn/(ET2 )(f ).

Back to Markov chains Since max i k Y i ψ1 C max i k Y i ψ1 log k, we can apply the result for independent variables to Z 1 (f ) +... + Z Cn/(ET2 )(f ). In the empirical processes setting one also has to bound in terms of E sup Z 1 (f ) +... + Z Cn/(ET2 )(f ) f F E sup f (X 1 ) +... + f (X n ) f F (optional sampling + concentration for N)

How to handle the independent case? truncate and re-center the variables

How to handle the independent case? truncate and re-center the variables use Talagrand s inequality for the bounded part

How to handle the independent case? truncate and re-center the variables use Talagrand s inequality for the bounded part use another Talagrand s inequality to handle the unbounded part: Theorem (Talagrand) For independent, centered Banach space valued variables Z i and α (0, 1], Z 1 +... + Z n ψα C α ( Z 1 +... + Z n 1 + max i n Z i ψα ). In our case the Banach space is l (F) Truncation at the level of E max i sup f f (X i ) makes the unbounded part satisfy Z 1 +... + Z n 1 CE max i n Z i C α max i n Z i ψα. (Hoffman-Jørgensen inequality).

Optimality In the ineq. for independent variables (α = 1), log n in the exponent is optimal: P(X i = ±r) = 1 2 e r, P(X i = 0) = 1 e r, r. This example can be emulated with Markov chains which gives optimality of log n.

Final comments The same scheme can be applied under the assumption that T i ψα < (α 1). Unbounded functions: if f ψα(π) < then Z i ψα/2 <, which together with some additional arguments gives inequalities for the chain started from ν (W. Bednorz, R.A., unpublished), Using regeneration one can also obtain a bounded difference type inequality for symmetric functions (recovering e.g. Hoeffding inequalities for U-statistics in the Markov setting).

Some open (???) questions Can one get estimates of the form ( P(S (1+η)ES+t) exp t 2 ) (2 + δ)(et 2 ) 1 VarZ 1 What about drift conditions on f, guaranteeing that Z i (f ) ψ1 <? Important for applications to MCMC algorithms partial results by W. Bednorz (unpublished) Is there a nice characterizations of Orlicz functions for which a Hoffman-Joergensen type inequality holds? M. Talagrand > characterization for functions of the form ψ(x) = exp(xξ(x)) (where x large enough, ξ nondecreasing): ξ(e u ) Lξ(u) for u large enough. +K (η, δ)...?

Thank you