Confidence intervals for the mixing time of a reversible Markov chain from a single sample path

Size: px

Start display at page:

Download "Confidence intervals for the mixing time of a reversible Markov chain from a single sample path"

Iris Fields
5 years ago
Views:

1 Confidence intervals for the mixing time of a reversible Markov chain from a single sample path Daniel Hsu Aryeh Kontorovich Csaba Szepesvári Columbia University, Ben-Gurion University, University of Alberta ITA

2 Problem Irreducible, aperiodic, time-homogeneous Markov chain X 1 X 2 X 3 2

3 Problem Irreducible, aperiodic, time-homogeneous Markov chain X 1 X 2 X 3 There is a unique stationary distribution π with lim L(X t X 1 = x) = π, for all x X. t 2

4 Problem Irreducible, aperiodic, time-homogeneous Markov chain X 1 X 2 X 3 There is a unique stationary distribution π with lim L(X t X 1 = x) = π, for all x X. t The mixing time t mix is the earliest time t with sup L(X t X 1 = x) π tv 1/4. x X 2

5 Problem Irreducible, aperiodic, time-homogeneous Markov chain X 1 X 2 X 3 There is a unique stationary distribution π with lim L(X t X 1 = x) = π, for all x X. t The mixing time t mix is the earliest time t with Problem: sup L(X t X 1 = x) π tv 1/4. x X Determine (confidently) if t t mix after seeing X 1, X 2,..., X t. 2

6 Problem Irreducible, aperiodic, time-homogeneous Markov chain X 1 X 2 X 3 There is a unique stationary distribution π with lim L(X t X 1 = x) = π, for all x X. t The mixing time t mix is the earliest time t with Problem: sup L(X t X 1 = x) π tv 1/4. x X Given δ (0, 1) and X 1:t, determine non-trivial I t [0, ] with P(t mix I t ) 1 δ. 2

7 Some motivation from machine learning and statistics Chernoff bounds for Markov chains X 1 X 2 : for suitably well-behaved f : X R, with probability at least 1 δ, ( ) 1 t f (X i ) E π f t i=1 Õ tmix log(1/δ). t }{{} deviation bound Bound depends on t mix, which may be unknown a priori. 3

8 Some motivation from machine learning and statistics Chernoff bounds for Markov chains X 1 X 2 : for suitably well-behaved f : X R, with probability at least 1 δ, ( ) 1 t f (X i ) E π f t i=1 Õ tmix log(1/δ). t }{{} deviation bound Bound depends on t mix, which may be unknown a priori. Examples: Bayesian inference Posterior means & variances via MCMC Reinforcement learning Mean action rewards in an MDP Supervised learning Error rates of hypotheses from non-iid data 3

9 Some motivation from machine learning and statistics Chernoff bounds for Markov chains X 1 X 2 : for suitably well-behaved f : X R, with probability at least 1 δ, ( ) 1 t f (X i ) E π f t i=1 Õ tmix log(1/δ). t }{{} deviation bound Bound depends on t mix, which may be unknown a priori. Examples: Bayesian inference Posterior means & variances via MCMC Reinforcement learning Mean action rewards in an MDP Supervised learning Error rates of hypotheses from non-iid data Need observable deviation bounds. 3

10 Observable deviation bounds from mixing time bounds? Suppose an estimator ˆt mix = ˆt mix (X 1:t ) of t mix satisfies: P(t mix ˆt mix + ε t ) 1 δ. 4

11 Observable deviation bounds from mixing time bounds? Suppose an estimator ˆt mix = ˆt mix (X 1:t ) of t mix satisfies: P(t mix ˆt mix + ε t ) 1 δ. Then with probability at least 1 2δ, ( ) 1 t f (X i ) E π f (ˆt t Õ mix + ε t ) log(1/δ). t i=1 4

12 Observable deviation bounds from mixing time bounds? Suppose an estimator ˆt mix = ˆt mix (X 1:t ) of t mix satisfies: P(t mix ˆt mix + ε t ) 1 δ. Then with probability at least 1 2δ, ( ) 1 t f (X i ) E π f (ˆt t Õ mix + ε t ) log(1/δ). t i=1 But ˆt mix is computed from X 1:t, so ε t may also depend on t mix. 4

13 Observable deviation bounds from mixing time bounds? Suppose an estimator ˆt mix = ˆt mix (X 1:t ) of t mix satisfies: P(t mix ˆt mix + ε t ) 1 δ. Then with probability at least 1 2δ, ( ) 1 t f (X i ) E π f (ˆt t Õ mix + ε t ) log(1/δ). t i=1 But ˆt mix is computed from X 1:t, so ε t may also depend on t mix. Deviation bounds for point estimators are insufficient. Need (observable) confidence intervals for t mix. 4

14 What we do 5

15 What we do 1. Shift focus to relaxation time t relax to enable spectral methods. 5

16 What we do 1. Shift focus to relaxation time t relax to enable spectral methods. 2. Lower/upper bounds on sample path length for point estimation of t relax. 5

17 What we do 1. Shift focus to relaxation time t relax to enable spectral methods. 2. Lower/upper bounds on sample path length for point estimation of t relax. 3. New algorithm for constructing confidence intervals for t relax. 5

18 Relaxation time Let P be the transition operator of the Markov chain, and let λ be its second-largest eigenvalue modulus (i.e., largest eigenvalue modulus other than 1). 6

19 Relaxation time Let P be the transition operator of the Markov chain, and let λ be its second-largest eigenvalue modulus (i.e., largest eigenvalue modulus other than 1). Spectral gap: γ := 1 λ. Relaxation time: t relax := 1/γ. for π := min x X π(x). (t relax 1) ln 2 t mix t relax ln 4 π 6

20 Relaxation time Let P be the transition operator of the Markov chain, and let λ be its second-largest eigenvalue modulus (i.e., largest eigenvalue modulus other than 1). Spectral gap: γ := 1 λ. Relaxation time: t relax := 1/γ. for π := min x X π(x). (t relax 1) ln 2 t mix t relax ln 4 π Assumptions on P ensure γ, π (0, 1). 6

21 Relaxation time Let P be the transition operator of the Markov chain, and let λ be its second-largest eigenvalue modulus (i.e., largest eigenvalue modulus other than 1). Spectral gap: γ := 1 λ. Relaxation time: t relax := 1/γ. (t relax 1) ln 2 t mix t relax ln 4 π for π := min x X π(x). Assumptions on P ensure γ, π (0, 1). Spectral approach: construct CI s for γ and π. 6

22 Our results (point estimation) We restrict to reversible Markov chains on finite state spaces. Let d be the (known a priori) cardinality of the state space X. 7

23 Our results (point estimation) We restrict to reversible Markov chains on finite state spaces. Let d be the (known a priori) cardinality of the state space X. 1. Lower bound: To estimate γ within a constant multiplicative factor, every algorithm needs (w.p. 1/4) sample path of length ( d log d Ω + 1 ). γ π 7

24 Our results (point estimation) We restrict to reversible Markov chains on finite state spaces. Let d be the (known a priori) cardinality of the state space X. 1. Lower bound: To estimate γ within a constant multiplicative factor, every algorithm needs (w.p. 1/4) sample path of length ( d log d Ω + 1 ). γ π 2. Upper bound: Simple algorithm estimates γ and π within a constant multiplicative factor (w.h.p.) with sample path of length ( ) ( ) log d log d Õ π γ 3 (for γ ), Õ (for π ). π γ 7

25 Our results (point estimation) We restrict to reversible Markov chains on finite state spaces. Let d be the (known a priori) cardinality of the state space X. 1. Lower bound: To estimate γ within a constant multiplicative factor, every algorithm needs (w.p. 1/4) sample path of length ( d log d Ω + 1 ). γ π 2. Upper bound: Simple algorithm estimates γ and π within a constant multiplicative factor (w.h.p.) with sample path of length ( ) ( ) log d log d Õ π γ 3 (for γ ), Õ (for π ). π γ But point estimator confidence interval. 7

26 Our results (confidence intervals) 3. New algorithm: Given δ (0, 1) and X 1:t as input, constructs intervals It γ and It π such that P ( ) ( ) γ It γ 1 δ and P π It π 1 δ. Widths of intervals converge a.s. to zero at log log t t rate. 8

27 Our results (confidence intervals) 3. New algorithm: Given δ (0, 1) and X 1:t as input, constructs intervals It γ and It π such that P ( ) ( ) γ It γ 1 δ and P π It π 1 δ. Widths of intervals converge a.s. to zero at log log t t rate. 4. Hybrid approach: Use new algorithm to turn error bounds for point estimators into observable CI s. (This improves asymptotic rate for π interval.) 8

28 Plug-in estimator 9

29 Plug-in estimator Reversibility grants the symmetry of M := diag(π)p = { P X1 π(x 1 = x, X 2 = x ) } x,x X (doublet state probabilities in stationary chain). 9

30 Plug-in estimator Reversibility grants the symmetry of M := diag(π)p = { P X1 π(x 1 = x, X 2 = x ) } x,x X (doublet state probabilities in stationary chain). Moreover, eigenvalues of L := diag(π) 1/2 M diag(π) 1/2 are real, and satisfy 1 = λ 1 > λ 2 λ d > 1, γ = 1 max{λ 2, λ d }. 9

31 Plug-in estimator Reversibility grants the symmetry of M := diag(π)p = { P X1 π(x 1 = x, X 2 = x ) } x,x X (doublet state probabilities in stationary chain). Moreover, eigenvalues of L := diag(π) 1/2 M diag(π) 1/2 are real, and satisfy 1 = λ 1 > λ 2 λ d > 1, γ = 1 max{λ 2, λ d }. Plug-in estimator: estimate π and M from X 1:t (using empirical frequencies), then plug-in to formula for γ. 9

32 Chicken-and-egg problem (Matrix) Chernoff bound (for Markov chains) gives error bounds for estimates of π and M (and ultimately of L and γ? ): e.g., w.h.p., s log(d) log(t/π? ) γ? γ? klb Lk O. γ? π? t 10

33 Chicken-and-egg problem (Matrix) Chernoff bound (for Markov chains) gives error bounds for estimates of π and M (and ultimately of L and γ? ): e.g., w.h.p., s log(d) log(t/π? ) γ? γ? klb Lk O. γ? π? t This has inverse dependence on γ?. 10

Chicken-and-egg problem (Matrix) Chernoff bound (for Markov chains) gives error bounds for estimates of π and M (and ultimately of L and γ? ): e.g., w.h.p., s log(d) log(t/π?

34 Chicken-and-egg problem (Matrix) Chernoff bound (for Markov chains) gives error bounds for estimates of π and M (and ultimately of L and γ? ): e.g., w.h.p., s log(d) log(t/π? ) γ? γ? klb Lk O. γ? π? t This has inverse dependence on γ?. Can t solve the bound for γ? (unlike empirical Bernstein inequalities). 10

35 Direct estimation of P Alternative: directly estimate P from X 1:t. 11

36 Direct estimation of P Alternative: directly estimate P from X 1:t. Key advantage: observable confidence intervals for P via empirical Bernstein inequality for martingales. 11

37 Direct estimation of P Alternative: directly estimate P from X 1:t. Key advantage: observable confidence intervals for P via empirical Bernstein inequality for martingales. Two problems: 11

38 Direct estimation of P Alternative: directly estimate P from X 1:t. Key advantage: observable confidence intervals for P via empirical Bernstein inequality for martingales. Two problems: 1. Without appealing to symmetry structure, can argue P P ε = ˆγ γ O(ε 1/(2d) ), but this implies exponential slow-down in rate. 11

39 Direct estimation of P Alternative: directly estimate P from X 1:t. Key advantage: observable confidence intervals for P via empirical Bernstein inequality for martingales. Two problems: 1. Without appealing to symmetry structure, can argue P P ε = ˆγ γ O(ε 1/(2d) ), but this implies exponential slow-down in rate. 2. Direct appeal to symmetry structure of L = diag(π) 1/2 P diag(π) 1/2 gives bounds that depend on π, which is unknown. 11

40 Direct estimation of P Alternative: directly estimate P from X 1:t. Key advantage: observable confidence intervals for P via empirical Bernstein inequality for martingales. Two problems: 1. Without appealing to symmetry structure, can argue P P ε = ˆγ γ O(ε 1/(2d) ), but this implies exponential slow-down in rate. 2. Direct appeal to symmetry structure of L = diag(π) 1/2 P diag(π) 1/2 gives bounds that depend on π, which is unknown. Our approach: Directly estimate P, and indirectly estimate π via P. 11

41 Indirect estimation of π 1. We ensure that P is transition operator for an ergodic chain (easy via Laplace smoothing). 12

42 Indirect estimation of π 1. We ensure that P is transition operator for an ergodic chain (easy via Laplace smoothing). 2. Key step: estimate π via P via group inverse Â# of I P. 12

43 Indirect estimation of π 1. We ensure that P is transition operator for an ergodic chain (easy via Laplace smoothing). 2. Key step: estimate π via P via group inverse Â# of I P. Â # contains virtually everything that one would want to know about the chain [with transition operator P] (Meyer, 1975). 12

44 Indirect estimation of π 1. We ensure that P is transition operator for an ergodic chain (easy via Laplace smoothing). 2. Key step: estimate π via P via group inverse Â# of I P. Â # contains virtually everything that one would want to know about the chain [with transition operator P] (Meyer, 1975). Reveals unique stationary distribution ˆπ w.r.t. P. This is our indirect estimate of π. 12

45 Indirect estimation of π 1. We ensure that P is transition operator for an ergodic chain (easy via Laplace smoothing). 2. Key step: estimate π via P via group inverse Â# of I P. Â # contains virtually everything that one would want to know about the chain [with transition operator P] (Meyer, 1975). Reveals unique stationary distribution ˆπ w.r.t. P. This is our indirect estimate of π. Tells us how to bound ˆπ π in terms of P P. Hence, from this, we construct a confidence interval for π. 12

46 Overall algorithm (outline) 1. Form empirical estimate and confidence intervals for P (exploit Markov property & empirical Bernstein -type bounds). 2. Form estimate and confidence intervals for π (via group inverse of I P). 3. Form estimate and confidence interval for γ (via confidence intervals for π and P, & eigenvalue perturbation theory). 13

47 Recap and future work We resolve chicken-and-egg problem of observable confidence intervals for mixing time from a single sample path. 14

48 Recap and future work We resolve chicken-and-egg problem of observable confidence intervals for mixing time from a single sample path. Strongly exploit Markov property and ergodicity in confidence intervals for P and π. 14

49 Recap and future work We resolve chicken-and-egg problem of observable confidence intervals for mixing time from a single sample path. Strongly exploit Markov property and ergodicity in confidence intervals for P and π. Problem #1: close gap between lower and upper bounds on sample path length (for point estimation). 14

50 Recap and future work We resolve chicken-and-egg problem of observable confidence intervals for mixing time from a single sample path. Strongly exploit Markov property and ergodicity in confidence intervals for P and π. Problem #1: close gap between lower and upper bounds on sample path length (for point estimation). Problem #2: overcome computational bottlenecks from matrix operations. 14

51 Recap and future work We resolve chicken-and-egg problem of observable confidence intervals for mixing time from a single sample path. Strongly exploit Markov property and ergodicity in confidence intervals for P and π. Problem #1: close gap between lower and upper bounds on sample path length (for point estimation). Problem #2: overcome computational bottlenecks from matrix operations. Problem #3: handle large/continuous state spaces under suitable assumptions. 14

52 Recap and future work We resolve chicken-and-egg problem of observable confidence intervals for mixing time from a single sample path. Strongly exploit Markov property and ergodicity in confidence intervals for P and π. Problem #1: close gap between lower and upper bounds on sample path length (for point estimation). Problem #2: overcome computational bottlenecks from matrix operations. Problem #3: handle large/continuous state spaces under suitable assumptions. Thanks! 14

Mixing Time Estimation in Reversible Markov Chains from a Single Sample Path

Mixing Time Estimation in Reversible Markov Chains from a Single Sample Path Daniel Hsu Columbia University djhsu@cscolumbiaedu Aryeh Kontorovich Ben-Gurion University karyeh@csbguacil November, 05 Csaba