arxiv:math/ v1 [math.pr] 21 Aug 2006

Size: px

Start display at page:

Download "arxiv:math/ v1 [math.pr] 21 Aug 2006"

Anissa O’Neal’
5 years ago
Views:

1 Measure Concentration of Markov Tree Processes arxiv:math/ v1 [math.pr] 21 Aug 2006 Leonid Kontorovich School of Computer Science Carnegie Mellon University Pittsburgh, PA USA May 23, 2008 Abstract We prove an apparently novel concentration of measure result for Markov tree processes. The bound we derive reduces to the known bounds for Markov processes when the tree is a chain, thus strictly generalizing the known Markov process concentration results. We employ several techniques of potential independent interest, especially for obtaining similar results for more general directed acyclic graphical models. 1 Introduction An emerging paradigm for proving concentration results for nonproduct measures is to quantify the dependence between the variables and state the bounds in terms of that dependence. A process (measure) particularly amenable to this approach is the Markov process. Using different techniques, Marton (coupling method [4], 1996), Samson (log-sobolev inequality [6], 2000) and Kontorovich and Ramanan (martingale differences [3], 2006) have obtained qualitatively similar concentration of measure results for Markov processes. One natural generalization of the Markov process is the hidden Markov process; we proved a concentration result for this class in [2]. A different way to generalize the Markov process is via the Markov tree process, which we address in the present paper. If (S n, d) is a metric space and (X i ) 1 i n, X i S is a random process, a measure concentration result (for the purposes of this paper) is an inequality stating that for any Lipschitz (with respect to d) function f : S n R, we have P f(x) Ef(X) > t} α(t), (1) where α(t) is rapidly decaying to 0 as t gets large. The quantity η ij, defined below, has proved useful for obtaining concentration results. For 1 i < j n, y S i 1 and w S, let L(X n j Xi 1 1 = y, X i = w) be the law of X n j conditioned on Xi 1 1 = y and X i = w. Define η ij (y, w, w ) = L(X n j X i 1 1 = y, X i = w) L(X n j X i 1 1 = y, X i = w ) TV (2) 1

2 and η ij = sup sup y S i 1 w,w S η ij (y, w, w ) where TV is the total variation norm (see 2.1 to clarify notation). Let Γ and be upper-triangular n n matrices, with Γ ii = ii = 1 and Γ ij = η ij, ij = η ij for 1 i < j n. For the case where S = [0, 1] and d is the Euclidean metric on R n, Samson [6] showed that if f : [0, 1] n R is convex and Lipschitz with f Lip 1, then P f(x) Ef(X) > t} 2 exp ( t2 2 Γ 2 2 where Γ 2 is the l 2 operator norm of the matrix Γ; Marton [5] has a comparable result. For the case where S is countable and d is the (normalized) Hamming metric on S n, d(x, y) = 1 n n ½ xi y i}, i=1 ) (3) Kontorovich and Ramanan [3] showed that if f : S n R is Lipschitz with f Lip 1, then ( ) P f(x) Ef(X) > t} 2 exp t2 2 2 (4) where is the l operator norm of the matrix, also given by = max 1 i<n (1 + η i,i η i,n ). (5) This leads to a strengthening of the Markov measure concentration result in Marton [4]. The sharpest currently known Markov measure concentration result (for the normalized Hamming metric) was obtained in [3], in terms of the contraction coefficients (θ i ) 1 i<n of the Markov process: η ij θ i θ i+1 θ j 1. (6) In this paper, we prove a bound on η ij in terms of the contraction coefficients of the Markov tree process (Theorem 2.1). This bound is cumbersome to state without preliminary definitions, but it reduces to (6) in the case where the Markov tree is a chain. 2 Bounding η ij for Markov tree processes 2.1 Notational preliminaries Random variables are capitalized (X), specified state sequences are written in lowercase (x), the shorthand X j i X i... X j is used for all sequences, and brackets denote sequence concatenation: [x j i xk j+1 ] = xk i. Another way to index collections of variables is by subset: if I = i 1, i 2,...,i m } then we write x I x i1, x i2,..., x im }. Thus, if u R SI, then u xi = u (xi1,x i2,...,x im ) R 2

3 for each x I S I. We will use to denote set cardinalities. Sums will range over the entire space of the summation variable; thus x j i f(x j i ), and f(x I ) is shorthand for f(x I ). x j i Sj i+1 x I x I S I f(x j i ) stands for The probability operator P } is defined with respect the measure space specified in context. We will write [n] for the set 1,..., n}. Anytime appears without a subscript, it will always denote the total variation norm TV. Regarding the latter, we recall that if τ is a signed, balanced measure on a countable set X (i.e., τ(x) = x X τ(x) = 0), then τ TV = 1 2 τ x X τ(x). (7) If G = (V, E) is a graph, we will frequently abuse notation and write u G instead of u V, blurring the distinction between a graph and its vertex set. This notation will carry over to set-theoretic operations (G = G 1 G 2 ) and indexing of variables (e.g., X G ). Unless we will need to refer explicitly to a σ-algebra, we will suppress it in the probability space notation, using less precise formulations, such as Let µ be a measure on S n. Furthermore, to avoid the technical but inessential complications associated with infinite sets, we will take S to be finite in this paper, noting only that the bounds carry over unchanged to the countable case (as done in [3] and [2]). To extend the results to the continuous case, some mild measure-theoretic assumptions are needed (see [5]). 2.2 Definition of Markov tree process Graph-theoretic preliminaries Consider a directed acyclic graph G = (V, E), and define a partial order G on G by the transitive closure of the relation u G v if (u, v) E. We define the parents and children of v V in the natural way: and parents(v) = u V : (u, v) E} children(v) = w V : (v, w) E}. If G is connected and each v V has at most one parent, G is called a (directed) tree. In a tree, whenever u G v there is a unique directed path from u to v. A tree T always has a unique minimal (w.r.t. T ) element r 0 V, called its root. Thus, for every v V there is a unique directed path r 0 T r 1 T... T r d = v; define the depth of v, dep T (v) = d, to be the length (i.e., number of edges) of this path. Note that dep T (r 0 ) = 0. We define the depth of the tree by dep(t) = sup v T dep T (v). For d = 0, 1,... define the d th level of the tree T by lev T (d) = v V : dep T (v) = d}; note that the levels induce a disjoint partition on V : V = dep(t) d=1 lev T (d). 3

4 We define the width of a tree as the greatest number of nodes in any level: wid(t) = sup lev T (d). (8) 1 d dep(t) We will consistently take V = n for finite V. An ordering J : V N of the nodes is said to be breadth-first if dep T (u) < dep T (v) = J(u) < J(v). (9) Since every directed tree T = (V, E) has some breadth-first ordering, 1 we shall henceforth blur the distinction between v V and J(v), simply taking V = [n] (or V = N) and assuming that dep T (u) < dep T (v) u < v holds. This will allow us to write S V simply as S n for any set S. Note that we have two orders on V : the partial order T, induced by the tree topology, and the total order <, given by the breadth-first enumeration. Observe that i T j implies i < j but not the other way around. If T = (V, E) is a tree and u V, we define the subtree induced by u, T u = (V u, E u ) by V u = v V : u T v}, E u = (v, w) E : v, w V u } Markov tree measure If S is a finite set, a Markov tree measure µ is defined on S n by a tree T = (V, E) and transition kernels p 0, p ij ( )} (i,j) E. Continuing our convention in 2.2.1, we have a breadthfirst order < and the total order T on V, and take V = 1,...,n}. Together, the topology of T and the transition kernels determine the measure µ on S n : µ(x) = p 0 (x 1 ) p ij (x j x i ). (10) (i,j) E A measure on S n satisfying (10) for some T and p ij } is said to be compatible with tree T; a measure is a Markov tree measure if it is compatible with some tree. Suppose S is a finite set and (X i ) i N, X i S is a random process defined on (S N, P). If for each n > 0 there is a tree T (n) = ([n], E (n) ) and a Markov tree measure µ n compatible with T (n) such that for all x S n we have P X n 1 = x} = µ n (x) then we call X a Markov tree process. The trees T (n) } are easily seen to be consistent in the sense that T (n) is an induced subgraph of T (n+1). So corresponding to any Markov tree process is the unique infinite tree T = (N, E). The uniqueness of T is easy to see, since for v > 1, the parent of v is the smallest u N such that P X v = x v X u 1 = x u 1 } = P X v = x v X u = x u }; thus P determines the topology of T. It is straightforward to verify that a Markov tree process X v } v T compatible with tree T has the following Markov property: if v and v are children of u in T, then P X Tv = x, X Tv = x X u = y } = P X Tv = x X u = y} P X Tv = x X u = y }. In other words, the subtrees induced by the children are conditionally independent given the parent; this follows directly from the definition of the Markov tree measure in (10). 1 One can easily construct a breadth-first ordering on a given tree by ordering the nodes arbitrarily within each level and listing the levels in ascending order: lev T(1),lev T(2),.... 4

5 2.3 Statement of result Theorem 2.1. Let S be a finite set and let (X i ) 1 i n, X i S be a Markov tree process, defined by a tree T = (V, E) and transition kernels p 0, p uv ( )} (u,v) E. Define the (u, v)- contraction coefficient θ uv by θ uv = sup p uv ( y) p uv ( y ) TV. (11) y,y S Suppose max (u,v) E θ uv θ < 1 for some θ and wid(t) L. Then for the Markov tree process X we have η ij ( 1 (1 θ) L) (j i)/l (12) for 1 i < j n. To cast (12) in more usable form, we first note that for L N and k N, if k L then k k (13) L 2L 1 (we omit the elementary number-theoretic proof). Using (13), we have η ij θ j i, for j i + L (14) where θ = (1 (1 θ) L ) 1/(2L 1). The bounds in (3) and (4) are for different metric spaces and therefore not readily comparable (the result in (3) has the additional convexity assumption). For the case where (14) holds, Samson s bound [6] yields Γ θ, (15) 1 2 and the approximation k=0 θ k = 1 1 θ (16) holds trivially via (5). 2 In the (degenerate) case where the Markov tree is a chain, we have L = 1 and therefore θ = θ; thus we recover the Markov chain concentration results in [3, 4, 6] and the approximations in (15,16) become precise inequalities. 2.4 Proof of main result The proof of Theorem 2.1 is combination of elementary graph theory and tensor algebra. We start with a graph-theoretic lemma: 2 The statement is approximate because (14) does not hold for all j > i but only starting with j i + L. The difference between ( 1 (1 θ) L) (j i)/l = 1 and θj i for i < j < i + L is at most 1 θ L 1 and affects only a fixed finite number (L 1) of entries in each row of Γ and. Since 2 and are continuous functionals, we are justified in claiming the approximate bound, which may be quantified if an application calls for it. The statements in (15) and (16) are only meant to convey an order of magnitude. 5

6 Lemma 2.2. Let T = ([n], E) be a tree and fix 1 i < j n. Suppose (X i ) n 1 is a Markov tree process whose law P on S n is compatible with T. Define the set T j i = T i j, j + 1,..., n}, consisting of those nodes in the subtree T i whose breadth-first numbering does not precede j. Then, for y S i 1 and w, w S, we have η ij (y, w, w ) = 0 T j i = η ij0 (y, w, w ) otherwise, (17) where j 0 is the minimum (with respect to <) element of T j i. Remark 2.3. This lemma tells us that when computing η ij it is sufficient to restrict our attention to the subtree induced by i. Proof. The case j T i implies j 0 = j and is trivial; thus we assume j / T i. In this case, the subtrees T i and T j are disjoint. Putting T i = T i \ i}, we have by the Markov property, P X Ti = x Ti, X Tj = x Tj X i 1 = [w y]} = P X Ti = x Ti X i = w } P X Tj = x Tj X i 1 1 = y }. Then from (2) and (7), and by marginalizing out the X Tj, we have η ij = 1 2 P X n j = x n j Xi 1 = [y w i] } P Xj n = xn j Xi 1 = [y w i ]} = 1 2 x n j P x T j i } X T j = x i T j X i = w P X i T j = x i T j X i = w }. i If T j i = then obviously η ij = 0; otherwise, η ij = η ij0, since j 0 is the first element of T j i. Next we develop some basic results for tensor norms; recall that unless specified otherwise, the norm used in this paper is the total variation norm defined in (7). If A is an M N columnstochastic matrix: (A ij 0 for 1 i M, 1 j N and M i=1 A ij = 1 for all 1 j N) and u R N is balanced in the sense that N j=1 u j = 0, we have, by the contraction lemma in [3], where Au A u, (18) A = max 1 j,j N A,j A,j, (19) and A,j denotes the j th column of A. An immediate consequence of (18) is that satisfies for column-stochastic matrices A R M N and B R N P. AB A B (20) Remark 2.4. Note that if A is a column-stochastic matrix then A 1, and if additionally u is balanced then Au is also balanced. If u R M and v R N, define their tensor product w = v u by w (i,j) = u i v j, 6

7 where the notation (v u) (i,j) is used to distinguish the 2-tensor w from an M N matrix. The tensor w is a vector in R MN indexed by pairs (i, j) [M] [N]; its norm is naturally defined to be w = 1 2 w(i,j). (21) (i,j) [M] [N] The following result will play a key role in deriving our bound (we suppress the boldfaced vector notation for readability): Lemma 2.5. Consider two finite sets X, Y, with probability measures p, p on X and q, q on Y. Then p q p q p p + q q p p q q. (22) Remark 2.6. Note that p q is a 2-tensor in R X Y and a probability measure on X Y. Proof. Fix q, q and define the function F(u, v) = x X u x v x + q q ( 2 x X u x v x ) x X,y Y over the convex polytope U R X R X, U = (u, v) : u x, v x 0, u x = } v x = 1 ; note that proving the claim is equivalent to showing that F 0 on U. For any σ 1, +1} X, let U σ = (u, v) U : sgn(u x v x ) = σ x }; note that U σ is a convex polytope and that U = σ 1,+1} U X σ. 3 Pick an arbitrary τ 1, +1} X Y and define ( ) F σ (u, v) = x σ x (u x v x ) + q q 2 x σ x (u x v x ) x,y u x q y v x q y τ xy (u x q y v x q y ) over U σ. Since σ x (u x v x ) = u x v x and τ xy can be chosen (for any given u, v, q, q ) so that τ xy (u x q y v x q y) = ux q y v x q y, the claim that F 0 on U will follow if we can show that F σ 0 on U σ. Observe that F σ is affine in its arguments (u, v) and recall that an affine function achieves its extreme values on the extreme points of a convex domain. Thus to verify that F σ 0 on U σ, we need only check the value of F σ on the extreme points of U σ. The extreme points of U σ are pairs (u, v) such that, for some x, x X, u = δ(x ) and v = δ(x ), where δ(x 0 ) R X is given by [δ(x 0 )] x = ½ x=x0}. Let (û, ˆv) be an extreme point of U σ. The case û = ˆv is trivial, so assume û ˆv. In this case, x X σ x(û x ˆv x ) = 2 and τ xy (û x q y ˆv x q y) ûx q y ˆv x q y x X,y Y 2. x X,y Y This shows that F σ 0 on U σ and completes the proof. 3 We define sgn(z) = ½ z 0} ½ z<0}. Note that the constraint x X ux = x X vx = 1 forces Uσ = (u, v) U : u x = v x} when σ x = +1 for all x X and U σ = when σ x = 1 for all x X. Both of these cases are trivial. 7

8 To develop a convenient tensor notation, we will fix the index set V = 1,..., n}. For I V, a tensor indexed by I is a vector u R SI. A special case of such an I-tensor is the product u = i I v(i), where v (i) R S and u xi = i I(v (i) ) xi. To gain more familiarity with the notation, let us write the total variation norm of an I-tensor: u = 1 2 x I S I u xi. In order to extend Lemma 2.5 to product tensors, we will need to define the function α k : R k R and state some of its properties: Lemma 2.7. Define α k : R k R recursively as α 1 (x) = x and Then α k+1 (x 1, x 2,..., x k+1 ) = x k+1 + (1 x k+1 )α k (x 1, x 2,..., x k ). (23) (a) α k is symmetric in its k arguments, so it is well-defined as a mapping from finite real sets to the reals α : x i : 1 i k} R (b) α k takes [0, 1] k to [0, 1] and is monotonically increasing in each argument on [0, 1] k (c) If B C [0, 1] then α(b) α(c) (d) α k (x, x,..., x) = 1 (1 x) k (e) if 1 B [0, 1] then α(b) = 1. Remark 2.8. In light of (a), we will use the notation α k (x 1, x 2,..., x k ) and α(x i : 1 i k}) interchangeably, as dictated by convenience. Proof. Claims (a), (b), (e) are straightforward to verify from the recursive definition of α and induction. Claim (c) follows from (b) since α k+1 (x 1, x 2,..., x k, 0) = α k (x 1, x 2,..., x k ) and (d) is easily derived from the binomial expansion of (1 x) k. The function α k is the natural generalization of α 2 (x 1, x 2 ) = x 1 + x 2 x 1 x 2 to k variables, and it is what we need for the analogue of Lemma 2.5 for a product of k tensors: Corollary 2.9. Let u (i) } i I and v (i) } i I be two sets of tensors and assume that each of u (i),v (i) is a probability measure on S. Then we have u (i) u v (i) α (i) v (i) } : i I. (24) i I i I Proof. Pick an i 0 I and let p = u (i0), q = v (i0), p = u (i), q = i 0 i I i 0 i I v (i). Apply Lemma 2.5 to p q p q and proceed by induction. 8

9 Our final generalization concerns linear operators over I-tensors. An I, J-matrix A has dimensions S J S I and takes an I-tensor u to a J-tensor v: for each x J S J, we have v xj = x I S I A xj,x I u xi, (25) which we write as Au = v. If A is an I, J-matrix and B is a J, K-matrix, the matrix product BA is defined analogously to (25). As a special case, an I, J-matrix might factorize as a tensor product of S S matrices A (i,j) R S S. We will write such a factorization in terms of a bipartite graph G = (I + J, E), where E I J and the factors A (i,j) are indexed by (i, j) E: A = A (i,j), (26) (i,j) E where A xj,x I = (i,j) E A (i,j) x j,x i for all x I S I and x J S J. The norm of an I, J-matrix is a natural generalization of the matrix norm defined in (19): A = where A,xI is the J-tensor given by max A,xI A,x x I,x I (27) I SI u xj = A xj,x I ; (27) is well-defined via the tensor norm in (21). Since I, J matrices act on I-tensors by ordinary matrix multiplication, Au A u continues to hold when A is a column-stochastic I, J- matrix and u is a balanced I-tensor; if, additionally, B is a column-stochastic J, K-matrix, BA B A also holds. Likewise, since another way of writing (26) is A,xI = (i,j) E A (i,j),x i, Corollary 2.9 extends to tensor products of matrices: Lemma Fix index sets I, J and a bipartite graph (I + J, E). Let A (i,j)} (i,j) E be a collection of column-stochastic S S matrices, whose tensor product is the I, J matrix A = A (i,j). (i,j) E Then A (i,j) A α } : (i, j) E. We are now in a position to state the main technical lemma, from which Theorem 2.1 will follow straightforwardly: Lemma Let S be a finite set and let (X i ) 1 i n, X i S be a Markov tree process, defined by a tree T = (V, E) and transition kernels p 0, p uv ( )} (u,v) E. Let the (u, v)-contraction coefficient θ uv be as defined in (11). 9

10 Fix 1 i < j n and let j 0 = j 0 (i, j) be as defined in Lemma 2.2 (we are assuming its existence, for otherwise η ij = 0). Then we have η ij dep T (j 0) d=dep T (i)+1 α θ uv : v lev T (d)}. (28) Proof. For y S i 1 and w, w S, we have η ij (y, w, w ) = 1 2 P X n j = x n j X1 i = [y w] } P Xj n = x n j X1 i = [y w ] } (29) = 1 2 x n j ( x n j z j 1 i+1 } P Xi+1 n = [zj 1 i+1 xn j ] Xi 1 = [y w] ) P Xi+1 n = [zj 1 i+1 xn j ] Xi 1 ]} = [y w. (30) Let T i be the subtree induced by i and Z = T i i + 1,..., j 0 1} and C = v T i : (u, v) E, u < j 0, v j 0 }. (31) Then by Lemma 2.2 and the Markov property, we get η ij (y, w, w ) = 1 ( ) 2 P X C Z = x C Z X i = w} P X C Z = x C Z X i = w } x C x Z (the sum indexed by j 0,...,n} \ C marginalizes out). Define D = d k : k = 0,..., D } with d 0 = dep T (i), d D = dep T (j 0 ) and d k+1 = d k + 1 for 0 k < D. For d D, let I d = T i lev T (d) and G d = (I d 1 + I d, E d ) be the bipartite graph consisting of the nodes in I d 1 and I d, and the edges in E joining them (note that I d0 = i}). For (u, v) E, let A (u,v) be the S S matrix given by A (u,v) x,x = p uv(x x ) and note that A (u,v) = θuv. Then by the Markov property, for each x Id S I d and x Id 1 S I d 1, d D \ d 0 }, we have (32) P X Id = x Id X Id 1 = x Id 1 } = A (d) x Id,x Id 1, where A (d) = (u,v) E d A (u,v). Likewise, for d D \ d 0 }, P X Id = x Id X i = w} = x I1 x I2 x Id 1 P X I1 = x I1 X i = w} P X I2 = x I2 X I1 = x I1 } P X Id = x Id X Id 1 = x Id 1 } = (A (d) A (d 1) A (d1) ) xid,w. (33) 10

11 Define the (balanced) I d1 -tensor the I d D -tensor h = A (d1),w A(d1),w, (34) f = A (d D ) A (d D 1) A (d2) h, (35) and C 0, C 1, Z 0 1,..., n}: C 0 = C I dept (j 0), C 1 = C \ C 0, Z 0 = I dept (j 0) \ C 0, (36) where C and Z are defined in (31). For readability we will write p(x U ) instead of P X U = x U } below; no ambiguity should arise. Combining (32) and (33), we have η ij (y, w, w ) = 1 2 (p(x C Z X i = w) p(x C Z X i = w )) (37) x C x Z = 1 2 p(x C1 x Z0 )f C0 Z 0 (38) x Z0 x C0 x C1 = Bf (39) where B is the S C0 C1 S C0 Z0 column-stochastic matrix given by B (xc0 x C1 ),(x C 0 x Z0 ) = ½ x C0 =x C 0 } p(x C1 x Z0 ) with the convention that p(x C1 x Z0 ) = 1 if either of Z 0,C 1 is empty. The claim now follows by reading off the results previously obtained: Bf B f Eq. (7) f Remark 2.4 h D k=2 A (d k) Eqs. (20,35) D k=1 α A (u,v) : (u, v) Edk } Lemma Proof of Theorem 2.1. We will borrow the definitions from the proof of Lemma To upperbound η ij we first bound α A (u,v) : (u, v) Edk }. Since E dk wid(t) L (because every node in I dk has exactly one parent in I dk 1 ) and A (u,v) = θuv θ < 1, we appeal to Lemma 2.7 to obtain α A (u,v) : (u, v) E dk } 1 (1 θ) L. (40) Now we must lower-bound the quantity h = dep T (j 0 ) dep T (i). Since every level can have up to L nodes, we have j 0 i hl and so h (j 0 i)/l (j i)/l. 11

12 The calculations in Lemma 2.11 yield considerably more information than the simple bound in (12). For example, suppose the tree T has levels I d : d = 0, 1,...} with the property that the levels are growing at most linearly: I d cd for some c > 0. Let d i = dep T (i), d j = dep T (j 0 ), and h = d j d i. Then so which yields the bound j i j 0 i c d j d i+1 k = c 2 (d j(d j + 1) d i (d i + 1)) < c 2 ((d j + 1) 2 d 2 i) < c 2 (d i + h + 1) 2 h > 2(j i)/c d i 1, η ij h (1 (1 θ k ) ck ) (41) where θ k maxθ uv : (u, v) E k }. When θ k is small (ckθ k θ < 1), this becomes η ij < k=1 h (ckθ k ) (42) k=1 2(j i)/c di 1 k=1 (ckθ k ) (43) θ 2(j i)/c di 1. (44) This is a non-trivial bound for trees with linearly growing levels: recall that to bound (4,5), we must bound the series η ij. j=i+1 By the limit comparison test with the series j=1 1/j2, we have that θ 2(j i)/c di 1 j=i+1 converges for θ < 1. Similar techniques may be applied when the level growth is bounded by other slowly increasing functions. 3 Discussion We have presented a concentration of measure bound for Markov tree processes; to our knowledge, this is the first such result. 4 In the simple case of the contracting, bounded-width Markov 4 In a 2003 paper, Dembo et al. [1] presented large deviation bounds for typed Markov trees, which is a more general class of processes than the Markov tree processes defined here. The techniques used and bounds obtained in [1] are of a rather different flavor than here; this is not surprising since measure concentration and large deviations, while pursuing similar goals, tend to use different methods and state results that are often not immediately comparable. 12

13 tree processes (i.e., those for which wid(t) L < and sup u,v θ uv θ < 1), the bound takes on a particularly tractable form (12), and in the degenerate case L = 1 it reduces to the sharpest known bound for Markov chains. The techniques we develop extend well beyond the somewhat restrictive contracting-bounded-width case, as demonstrated in the calculation in (44). The technical results in 2.4, particularly Lemma 2.5 and its generalizations, might be of independent interest. It is hoped that these techniques will be extended to obtain concentration bounds for larger classes of directed acyclic graphical models. Acknowledgements I thank Kavita Ramanan for useful discussions. References [1] Amir Dembo, Peter Morters, Scott Sheffield, A large-deviation theorem for tree-indexed Markov chains [2] Leonid Kontorovich, Measure Concentration of Hidden Markov Processes [3] Leonid Kontorovich and Kavita Ramanan, A concentration inequality for weakly contracting Markov chains. Paper in preparation, [4] Katalin Marton, Bounding d-distance by informational divergence: a method to prove measure concentration. Ann. Probab., Vol. 24, No. 2, , [5] Katalin Marton, A measure concentration inequality for contracting Markov chains. Geom. Funct. Anal., Vol. 6, , [6] Paul-Marie Samson, Concentration of measure inequalities for Markov chains and Φ-mixing processes. Ann. Probab., Vol. 28, No. 1, ,

arxiv:math/ v5 [math.pr] 1 Oct 2006

arxiv:math/ v5 [math.pr] 1 Oct 2006 Measure Concentration of Markov Tree Processes arxiv:math/0608511v5 [math.pr] 1 Oct 2006 Leonid Kontorovich School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 USA lkontor@cs.cmu.edu