Tail inequalities for additive functionals and empirical processes of. Markov chains

Similar documents
Practical conditions on Markov chains for weak convergence of tail empirical processes

A regeneration proof of the central limit theorem for uniformly ergodic Markov chains

Limit theorems for dependent regularly varying functions of Markov chains

STA205 Probability: Week 8 R. Wolpert

A tail inequality for suprema of unbounded empirical processes with applications to Markov chains

When is a Markov chain regenerative?

arxiv:math/ v2 [math.pr] 16 Mar 2007

Lectures on Stochastic Stability. Sergey FOSS. Heriot-Watt University. Lecture 4. Coupling and Harris Processes

STA 711: Probability & Measure Theory Robert L. Wolpert

arxiv: v1 [math.pr] 19 Sep 2007

Simultaneous drift conditions for Adaptive Markov Chain Monte Carlo algorithms

New Bernstein and Hoeffding type inequalities for regenerative Markov chains

STAT 200C: High-dimensional Statistics

Ergodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R.

Strong approximation for additive functionals of geometrically ergodic Markov chains

Weak and strong moments of l r -norms of log-concave vectors

Some Results on the Ergodicity of Adaptive MCMC Algorithms

Ergodicity in data assimilation methods

On Reparametrization and the Gibbs Sampler

Generalization theory

1 Sequences of events and their limits

Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 2012

Subgaussian concentration inequalities for geometrically ergodic Markov chains

18.175: Lecture 8 Weak laws and moment-generating/characteristic functions

Notes 15 : UI Martingales

Concentration inequalities and the entropy method

Consistency of the maximum likelihood estimator for general hidden Markov models

Problem Sheet 1. You may assume that both F and F are σ-fields. (a) Show that F F is not a σ-field. (b) Let X : Ω R be defined by 1 if n = 1

Heavy Tailed Time Series with Extremal Independence

Selected Exercises on Expectations and Some Probability Inequalities

High Dimensional Probability

The coupling method - Simons Counting Complexity Bootcamp, 2016

SUPPLEMENT TO PAPER CONVERGENCE OF ADAPTIVE AND INTERACTING MARKOV CHAIN MONTE CARLO ALGORITHMS

Existence, Uniqueness and Stability of Invariant Distributions in Continuous-Time Stochastic Models

Introduction to Machine Learning CMU-10701

General Glivenko-Cantelli theorems

Notes 1 : Measure-theoretic foundations I

Faithful couplings of Markov chains: now equals forever

Lecture 2 One too many inequalities

Problem Points S C O R E Total: 120

ON CONVERGENCE RATES OF GIBBS SAMPLERS FOR UNIFORM DISTRIBUTIONS

Theorem 2.1 (Caratheodory). A (countably additive) probability measure on a field has an extension. n=1

On Differentiability of Average Cost in Parameterized Markov Chains

Eleventh Problem Assignment

Lecture 4 Lebesgue spaces and inequalities

arxiv: v2 [math.st] 13 Sep 2016

March 1, Florida State University. Concentration Inequalities: Martingale. Approach and Entropy Method. Lizhe Sun and Boning Yang.

CS281A/Stat241A Lecture 22

Lecture 1 Measure concentration

The PAC Learning Framework -II

Convergence to equilibrium for rough differential equations

On the Bennett-Hoeffding inequality

Math 180B Homework 4 Solutions

Notes 18 : Optional Sampling Theorem

Chapter 7. Markov chain background. 7.1 Finite state space

Some functional (Hölderian) limit theorems and their applications (II)

Convergence to equilibrium of Markov processes (eventually piecewise deterministic)

Random Variables. Saravanan Vijayakumaran Department of Electrical Engineering Indian Institute of Technology Bombay

A primer on basic probability and Markov chains

Nonparametric regression with martingale increment errors

Modern Discrete Probability Spectral Techniques

Mean-field dual of cooperative reproduction

Concentration, self-bounding functions

Concentration of Measures by Bounded Size Bias Couplings

Stat 516, Homework 1

2. Variance and Covariance: We will now derive some classic properties of variance and covariance. Assume real-valued random variables X and Y.

Generative Models and Stochastic Algorithms for Population Average Estimation and Image Analysis

Tasmanian School of Business & Economics Economics & Finance Seminar Series 1 February 2016

MAT 135B Midterm 1 Solutions

E X A M. Probability Theory and Stochastic Processes Date: December 13, 2016 Duration: 4 hours. Number of pages incl.

STAT 200C: High-dimensional Statistics

Useful Probability Theorems

Upper Bound for Intermediate Singular Values of Random Sub-Gaussian Matrices 1

Brownian Motion. 1 Definition Brownian Motion Wiener measure... 3

On the Bennett-Hoeffding inequality

Exercises: sheet 1. k=1 Y k is called compound Poisson process (X t := 0 if N t = 0).

Stable Lévy motion with values in the Skorokhod space: construction and approximation

Regeneration-based statistics for Harris recurrent Markov chains

Essentials on the Analysis of Randomized Algorithms

Local consistency of Markov chain Monte Carlo methods

Practical unbiased Monte Carlo for Uncertainty Quantification

Estimation of arrival and service rates for M/M/c queue system

Spring 2012 Math 541B Exam 1

Lecture 5: Asymptotic Equipartition Property

Large deviations of empirical processes

Mixing time for a random walk on a ring

The simple slice sampler is a specialised type of MCMC auxiliary variable method (Swendsen and Wang, 1987; Edwards and Sokal, 1988; Besag and Green, 1

Lecture Notes 5 Convergence and Limit Theorems. Convergence with Probability 1. Convergence in Mean Square. Convergence in Probability, WLLN

Quantitative Non-Geometric Convergence Bounds for Independence Samplers

Mathematics Qualifying Examination January 2015 STAT Mathematical Statistics

Exercises in Extreme value theory

Statistical Machine Learning

Entropy and Ergodic Theory Lecture 15: A first look at concentration

Deviations from the Mean

Tail bound inequalities and empirical likelihood for the mean

Lecture 3 Stationary Processes and the Ergodic LLN (Reference Section 2.2, Hayashi)

Probability Theory II. Spring 2016 Peter Orbanz

Stein s Method: Distributional Approximation and Concentration of Measure

) ) = γ. and P ( X. B(a, b) = Γ(a)Γ(b) Γ(a + b) ; (x + y, ) I J}. Then, (rx) a 1 (ry) b 1 e (x+y)r r 2 dxdy Γ(a)Γ(b) D

An introduction to adaptive MCMC

Transcription:

Tail inequalities for additive functionals and empirical processes of geometrically ergodic Markov chains University of Warsaw Banff, June 2009

Geometric ergodicity Definition A Markov chain X = (X n ) n 1 on a Polish space X with a transition function P(, ): X B(X ) [0, 1] and a unique stationary distribution π is called geometrically ergodic if there exists ρ < 1 such that for every x X there exists a constant M(x), such that P n (x, ) π TV M(x)ρ n. If M(x) can be taken independent of x, then X is called uniformly ergodic.

Main question What is the tail decay of S := n f (X i ), i=1 where f : X R, E π f = 0, f a

Main question What is the tail decay of S := n f (X i ), i=1 where f : X R, E π f = 0, f a or S := sup f F n f (X i ), i=1 where F - a countable class of f s as above.

Regeneration method, split chain Definition A set C B(X ) is called a small set if there exists a probability measure ν on X and ε > 0 such that for all x C and A B(X ) and for all x X. P(x, A) εν(a) P x ( n>1 X n C) = 1

Regeneration method, split chain We can define a new chain ( X n, R n ) in the following way. Given X n = x, if x / C, draw X n+1 from P(x, ), set R n = 0

Regeneration method, split chain We can define a new chain ( X n, R n ) in the following way. Given X n = x, if x / C, draw X n+1 from P(x, ), set R n = 0 if x C, toss a coin with probability of heads equal to ε

Regeneration method, split chain We can define a new chain ( X n, R n ) in the following way. Given X n = x, if x / C, draw X n+1 from P(x, ), set R n = 0 if x C, toss a coin with probability of heads equal to ε heads: draw X n+1 from ν, set R n = 1

Regeneration method, split chain We can define a new chain ( X n, R n ) in the following way. Given X n = x, if x / C, draw X n+1 from P(x, ), set R n = 0 if x C, toss a coin with probability of heads equal to ε heads: draw X n+1 from ν, set R n = 1 tails: draw X n+1 from and set R n = 0. P(x, ) εν( ). 1 ε

Regeneration method, split chain We can define a new chain ( X n, R n ) in the following way. Given X n = x, if x / C, draw X n+1 from P(x, ), set R n = 0 if x C, toss a coin with probability of heads equal to ε heads: draw X n+1 from ν, set R n = 1 tails: draw X n+1 from P(x, ) εν( ). 1 ε Since and set R n = 0. P(x, ) εν( ) εν( ) + (1 ε) = P(x, ), 1 ε X n is again a Markov chain with transition function P (and we will identify it with X n ).

Regeneration method, split chain Let T 1 = inf{n > 0: R n = 1}, T i+1 = inf{n > 0: R T1 +...+T i +n = 1} and Y 0 = (X 1,..., X T1 ) Y i = (X T1 +...+T i +1,..., X T1 +...+T i+1 ). Fact Blocks Y i, i 0 are independent, blocks Y i, i 1 are i.i.d If f : X R and Z i = Z i (f ) = T 1 +...+T i+1 j=t 1 +...+T i +1 then for i 1, EZ i = (ET 2 ) 1 E π f. f (X j ),

Regeneration method, summary We can write f (X 1 ) +... + f (X n ) = Z 0 +... + Z N + where N = sup{i N: T 1 +... + T i+1 n}, Z i = Z i (f ) = Z 0 = Z 0 (f ) = T 1 n i=1 T 1 +...+T i+1 i=t 1 +...+T i +1 f (X i ), n i=t 1 +...+T N+1 +1 f (X i ), i 1, and use the i.i.d. theory to analyze additive functionals. f (X i ),

Regeneration method, summary The idea goes back to Nummelin (early 80 s), developed subsequently by Meyn and Tweedie for proving limit theorems. A sample recent result is Theorem (Bednorz, Latała, Łatuszyński) If E π f 2 <, E π f = 0 then f (X 1 ) +... + f (X n ) n converges weakly iff EZ 1 (f ) 2 <. The limiting distribution is N (0, σ 2 ), where σ 2 = Var(Z 1 )/(ET 2 ). For concentration inequalities, the regeneration method has been used e.g. by Clémençon (2001) and Douc, Guillin, Moulines (2008).

Drift conditions Theorem (Meyn, Tweedie) A Markov chain X n is geometrically ergodic iff there exists V : X [1, ) and constants λ < 1 and K <, such that PV (x) = X V (y)p(x, dy) Theorem (Meyn & Tweedie, Baxendale) { λv (x) for x / C, K for x C If X 1 µ, E µ V < then T 1 ψ1, T 2 ψ1 <. Corollary Consider a set F of functions f : X R, such that for all f F, f a. Then sup Z i (f ) ψ1 Caτ, f F where τ = max( T 1 ψ1, T 2 ψ1 ).

Main theorem a single function Theorem (R.A. 2008) Consider a function f : X R, such that f a and E π f = 0. Define also the random variable Then for all t > 0, S = n f (X i ). i=1 ( ) ( P S > t K exp 1 ( K min t 2 t )) n(et 2 ) 1, VarZ 1 τ 2. a log n Remark: (ET 2 ) 1 VarZ 1 is the variance of the limiting normal variable.

Main theorem empirical processes Theorem (R.A. 2008) Consider a countable class F of measurable functions f : X R, such that f a and E π f = 0. Define the random variable S = sup f F n f (X i ) i=1 and the asymptotic weak variance Then for all t 1, σ 2 = sup VarZ 1 (f )/ET 2. f F ( ) ( P S K ES + t K exp 1 ( t 2 K min nσ 2, t )) τ 3 (ET 2 ) 1. a log n

Sketch of the proof Recall that f (X 1 ) +... + f (X n ) = Z 0 + Z 1 +... + Z N + n f (X i ), i=t 1 +...+T N+1 +1

Sketch of the proof Recall that f (X 1 ) +... + f (X n ) = Z 0 + Z 1 +... + Z N + n f (X i ), i=t 1 +...+T N+1 +1 Z 0 ψ1 Caτ = P( Z 0 t) 2e ct/(aτ).

Sketch of the proof Recall that f (X 1 ) +... + f (X n ) = Z 0 + Z 1 +... + Z N + n f (X i ), i=t 1 +...+T N+1 +1 Z 0 ψ1 Caτ = P( Z 0 t) 2e ct/(aτ). One can easily show that (n (T 1 +... + T N+1 )) + ψ1 Cτ log τ This allows to handle the last term.

Sketch of the proof Recall that f (X 1 ) +... + f (X n ) = Z 0 + Z 1 +... + Z N + n f (X i ), i=t 1 +...+T N+1 +1 Z 0 ψ1 Caτ = P( Z 0 t) 2e ct/(aτ). One can easily show that (n (T 1 +... + T N+1 )) + ψ1 Cτ log τ This allows to handle the last term. What remains is Z 1 +... + Z N - a sum of random length.

By the LLN, N n/(et 2 ) (quantitative bounds by Bernstein s ψ 1 inequality), so with high probability Z 1 +... + Z N max i Cn/(ET 2 ) Z 1 +... + Z i and we can use Levy-type inequality due to Montgomery-Smith.

By the LLN, N n/(et 2 ) (quantitative bounds by Bernstein s ψ 1 inequality), so with high probability Z 1 +... + Z N max i Cn/(ET 2 ) Z 1 +... + Z i and we can use Levy-type inequality due to Montgomery-Smith. We are left with Z 1 (f ) +... + Z Cn/(ET2 )(f ) where Z i are i.i.d. and we control VarZ i (f ) and sup f F Z i (f ) ψ1.

Inequality for independent variables Consider now X 1,..., X n independent r.v. s F a countable class of measurable functions f, s.t. Ef (X i ) = 0 and for some α (0, 1], sup f (X i ) ψα <. f S = sup f F σ 2 = sup f F n f (X i ). i=1 n Ef (X i ) 2. i=1

Inequality for independent variables Theorem (R.A. 2008) Under the above assumptions, for all 0 < η < 1 and δ > 0, t 0, P(S (1 + η)es + t), P(S (1 η)es t) ( t 2 ) exp 2(1 + δ)σ 2 ( ( t ) α ) + 3 exp C max i sup f F f (X i ) ψα where C = C(α, η, δ).

Back to Markov chains Since max i k Y i ψ1 C max i k Y i ψ1 log k, we can apply the result for independent variables to Z 1 (f ) +... + Z Cn/(ET2 )(f ).

Back to Markov chains Since max i k Y i ψ1 C max i k Y i ψ1 log k, we can apply the result for independent variables to Z 1 (f ) +... + Z Cn/(ET2 )(f ). In the empirical processes setting one also has to bound in terms of E sup Z 1 (f ) +... + Z Cn/(ET2 )(f ) f F E sup f (X 1 ) +... + f (X n ) f F (optional sampling + concentration for N)

How to handle the independent case? truncate and re-center the variables

How to handle the independent case? truncate and re-center the variables use Talagrand s inequality for the bounded part

How to handle the independent case? truncate and re-center the variables use Talagrand s inequality for the bounded part use another Talagrand s inequality to handle the unbounded part: Theorem (Talagrand) For independent, centered Banach space valued variables Z i and α (0, 1], Z 1 +... + Z n ψα C α ( Z 1 +... + Z n 1 + max i n Z i ψα ). In our case the Banach space is l (F)

How to handle the independent case? truncate and re-center the variables use Talagrand s inequality for the bounded part use another Talagrand s inequality to handle the unbounded part: Theorem (Talagrand) For independent, centered Banach space valued variables Z i and α (0, 1], Z 1 +... + Z n ψα C α ( Z 1 +... + Z n 1 + max i n Z i ψα ). In our case the Banach space is l (F) Truncation at the level of E max i sup f f (X i ) makes the unbounded part satisfy Z 1 +... + Z n 1 CE max i n Z i C α max i n Z i ψα. (Hoffman-Jørgensen inequality).

Optimality In the ineq. for independent variables (α = 1), log n in the exponent is optimal: P(X i = ±r) = 1 2 e r, P(X i = 0) = 1 e r, r. This example can be emulated with Markov chains which gives optimality of log n.

Final comments The same scheme can be applied under the assumption that T i ψα < (α 1). Unbounded functions: if f ψα(π) < then Z i ψα/2 <, which together with some additional arguments gives inequalities for the chain started from ν (W. Bednorz, R.A., unpublished), Using regeneration one can also obtain a bounded difference type inequality for symmetric functions (recovering e.g. Hoeffding inequalities for U-statistics in the Markov setting).

Some open (???) questions Can one get estimates of the form ( P(S (1+η)ES+t) exp t 2 ) (2 + δ)(et 2 ) 1 VarZ 1 What about drift conditions on f, guaranteeing that Z i (f ) ψ1 <? Important for applications to MCMC algorithms partial results by W. Bednorz (unpublished) Is there a nice characterizations of Orlicz functions for which a Hoffman-Joergensen type inequality holds? M. Talagrand > characterization for functions of the form ψ(x) = exp(xξ(x)) (where x large enough, ξ nondecreasing): ξ(e u ) Lξ(u) for u large enough. +K (η, δ)...?

Thank you