A note on stochastic context-free grammars, termination and the EM-algorithm

A note on stochastic context-free grammars, termination and the EM-algorithm Niels Richard Hansen Department of Applied Mathematics and Statistics, University of Copenhagen, Universitetsparken 5, 2100 Copenhagen Ø, Denmark Abstract Termination of a stochastic context-free grammar, i.e. almost sure finiteness of the random trees it produces, is shown to be equivalent to extinction of an embedded multitype branching process. We show that the maximum likelihood estimator in a saturated model based on complete or partial observation of a finite tree always gives terminating grammars. With partial observation we show that this in fact holds for the whole sequence of parameters obtained by the EM-algorithm. Finally, aspects of the size of the tree related to the embedded branching process is discussed. Key words: EM-algorithm, maximum likelihood estimator, multitype braching process, stochastic context-free grammar 1 Introduction A stochastic context-free grammar can be defined as a probability measure on a set of rooted trees. We give a formal definition in Section 2. This measure is specified by a set of rules for evolving symbols known as non-terminals into sequences of non-terminals and terminals (another set of symbols) and by probabilities assigned to these evolution rules. The purpose is to end up with a probability measure on the set of finite sequences of terminals. An modern application of stochastic context-free grammars can be found in the area of biological sequence analysis and especially the modeling of RNA-molecules (Eddy & Durbin, 1994), (Yasubumi et. al., 1994), (Durbin et. al., 1998). Email address: richard@math.ku.dk (Niels Richard Hansen). URL: www.math.ku.dk/ richard (Niels Richard Hansen).

We first show that one faces a termination problem in the sense that for stochastic context-free grammars the evolution can continue forever, i.e. the tree can be infinite with positive probability, and a resulting finite sequence of terminals is thus never obtained. We call the stochastic context-free grammar terminating if the tree is finite almost surely. This issue was first mentioned by Sankoff (1971). He remarks that a related branching process should be subcritical for the stochastic context-free grammar to be terminating, but this seems to have been neglected in the more recent literature. The main purpose of this paper is to show that if we consider a class of saturated models and uses maximum-likelihood estimation to infer the unknown parameters, the resulting stochastic context-free grammar will always be terminating. This holds whether or not we observe the entire tree or only the resulting finite sequence of terminals. In the last case where we have only partial observations, one will typically apply the EM-algorithm to find the maximum-likelihood estimate, and we actually show that the whole sequence of parameters obtained by running the EM-algorithm gives terminating stochastic context-free grammars. In the final section, Section 4, we discuss issues related to the distribution of the size of the tree especially for parameters on the boundary between terminating and non-terminating stochastic context-free grammars. 2 Stochastic context-free grammars Let E and A be two finite sets. We call elements in E non-terminals and elements in A terminals. We let T denote the set of (possibly infinite) rooted trees with internal nodes from E and leafs from A. Let S denote the set of finite sequences from the disjoint union E A and let P be an E S stochastic matrix. If ν is a probability measure on E, then P ν is the probability measure on T defined recursively as follows: First the root, a non-terminal x, is drawn from ν, and then conditionally on x the children of x, a sequence of terminals and non-terminals, are drawn from P(x, ). This is the first generation. If x 1,...,x m denotes the set of non-terminal children obtained in the n th generation, m new sequences, the n + 1 th generation, are drawn independently from P(x 1, ),..., P(x m, ) the children of x i being drawn from P(x i, ). Definition 1 The probability measure P ν on T defined above is called a stochastic context-free grammar. If T 0 T denote the subset of finite trees, the stochastic context-free grammar is said be terminating if P ν (T 0 ) = 1. Stochastic context-free grammars are used as a means to construct probability measures on the set A of finite sequences of terminals from A by a leaf traversal 2

map, f LT : T 0 A, defined as follows: If the tree T = {a}, for a A, only consists of a root terminal then f LT (T) = a. If the root of T is a non-terminal it has m 1 children each being the root of a tree. Denoting these trees T 1,...,T m, f LT is defined recursively by f LT (T) = f LT (T 1 ),...,f LT (T m ). This definition produces a sensible, finite sequence from A only if the tree T is finite, hence for f LT (P ν ) to be a well defined probability measure on A it is crucial that we restrict our attention to terminating stochastic context-free grammars. Let Λ denote the E E matrix with Λ x,y being the expected number of nonterminals y produced by the probability measure P(x, ). Let ρ(λ) denote the spectral radius of the matrix. Up to singularity (to be defined below), this spectral radius determines whether the stochastic context-free grammar is terminating. Definition 2 A stochastic context-free grammar is said to be singular if there exists a set E 0 E such that for all x E 0 and all γ S with P(x, γ) > 0 the sequence γ contains exactly one state from E 0. The grammar is called non-singular, if it is not singular. Theorem 3 A non-singular stochastic context-free grammar is terminating if and only if ρ(λ) 1. PROOF. Let X n,x denote the number of non-terminals x in the n th generation, and let X n = (X n,x ) x E N E 0. Then (X n ) n 0 is a multitype branching process with the types being the set of non-terminals E and Λ the matrix of offspring expectations. Moreover, if the stochastic context-free grammar is non-singular no subset of E will reproduce exactly one non-terminal from that subset with probability one, and according to (Harris, 1963), Theorem II.10.1, this branching process becomes extinct with probability one if and only if ρ(λ) 1 see also Chapter 2 in (Mode, 1971) and Chapter 4 in (Jagers, 1975). Since termination of the stochastic context-free grammar is the same as extinction of the branching process the result follows. 3 Maximum likelihood inference We consider a parametric model for stochastic context-free grammars, where we apriori for each non-terminal x fix a finite set of sequences γ S for which 3

we allow P(x, γ) > 0. The saturated model consists of all P s concordant with this requirement. We observe a single tree T (complete observation) or some transformation f(t) (partial observation) of the tree. The results generalize trivially to observing a number of iid trees. 3.1 With complete observation For convenience assume from here on that E = {1,..., n} and let Γ S be a finite set of sequences γ ij for i = 1,...,n and j = 1,...,m i, m i 1. We think of γ ij as the set of allowed sequences that can be drawn from non-terminal i. Denote by n ijk the number of times non-terminal k occurs in γ ij, and let n i11... n i1n N 1 N i =.. and N =.. (1) n imi 1... n imi n N n Thus N i is a m i n matrix and N is a ( i m i ) n matrix. Let p = (p ij ) with p ij = P(i, γ ij ) for i = 1,...,n, j = 1,..., m i. Define p i = (p i1,...,p imi ) and P = diag(p 1,...,p n ). (2) Since Λ ik = j p ij n ijk we see that Λ = PN. Consider Γ and i 0 as fixed and p = (p ij ) as a parameterization of the stochastic context-free grammars with root i 0, i.e. a parameterization of the probability measures P i0 = P i0,p on T conditionally on the root non-terminal being i 0. The parameter space is thus Θ = {p = (p ij ) m i p ij = 1, i = 1,..., n, p ij 0}. (3) j Having observed a finite tree T T 0 with root i 0, let c ij = c ij (T) denote the number of times that γ ij occurs in T. Then the log-likelihood is n m i l T (p) = c ij log(p ij ), (4) i=1 j=1 which of course gives the usual maximum likelihood estimator ˆp ij = s i 1 c ij, where s i = j c ij for i = 1,..., n provided s i > 0. If s i = 0 we can simply throw away non-terminal i before continuing. 4

Defining c i = (c i1,...,c imi ), C = diag(c 1,...,c n ), and S = diag(s 1,...,s n ), (5) we find that ˆP = S 1 C and therefore the estimated expectation matrix can be given the representation ˆΛ = S 1 CN. (6) We show below that ˆΛ has spectral radius 1. Besides some matrix manipulations, the crucial observation, resembling Kirchhoff s law for electric current, is that the total number of times that non-terminal k occurs as a child in T equals the total number of times that non-terminal k occurs as a parent in T when disregarding the root non-terminal i 0, which is the child of nobody. This is because the leafs in T are all terminals. Thus for k = 1,...,n n m i m k δ i0,k + c ij n ijk = c kj = s k. (7) i=1 j=1 j=1 If we define the vectors c and s by c = (c 1,...,c n ) and s = (s 1,...,s n ), (8) the equations given by (7) can be written in matrix form as δ i0 + cn = s. (9) with δ i0 = (δ i0,1,...,δ i0,n). Theorem 4 If ˆΛ is given by (6) with C and S defined in (5) and if c and s, as given by (8), fulfill equation (9) then ρ(ˆλ) 1. PROOF. First, for two square matrices A and B it holds that ρ(ab) = ρ(ba). The proof is elementary see for instance exercise I.3.7 in (Bhatia, 1997). Thus ρ(ˆλ) = ρ(s 1 CN) = ρ(cns 1 ). 5

Regarding the spectral radius of any matrix A with nonnegative entries it holds that ρ(a) max j A ij, (10) i cf. Theorem 1.1.5 and Corollary 1.1 in (Seneta, 1981) covering the case when A is irreducible. By decomposing a reducible matrix into irreducible blocks, it follows that (10) holds also if A is reducible. For the matrix CNS 1 one easily shows that the (row)vector of column sums equals cns 1 = (s δ i0 )S 1 = 1 δ i0 S 1 (11) where the first equality follows from (9). Here 1 denotes the vector of all ones. From (10) we get that ρ(ˆλ) 1. Remark 5 If ˆΛ is irreducible then ρ(ˆλ) < 1. This is due to the fact that for irreducible matrices the bound in (10) is strict unless all row sums are equal, cf. Corollary 1.1 in (Seneta, 1981). 3.2 With partial observation If we only observe the tree T partially we can rely on the EM-algorithm for estimation of the parameters. Let f : T F for some set F with t = f(t) denoting the partial observation. If f, like the leaf traversal map f LT, is defined apriori on T 0 only, we can extend it to take values in F { } by letting it take the value on T \T 0. For convenience we assume that the root non-terminal i 0 is known and fixed. As in the previous section, Γ is the fixed finite set of finite sequences γ ij, i = 1,...,n and j = 1,..., m i, and the parameter space is Θ as given by (3). We define Θ t = {p Θ P i0,p(f(t) = t) > 0}. For p Θ t let c ij (p, t) := E i0,p(c ij (T) t) 6

denote the conditional expectation of the variables c ij given the partial observation t F under the measure P i0,p given by the parameter p. Choosing some initial parameter value p 0 Θ t, the EM-algorithm updates the parameters recursively as follows; given p n (1) compute c ij (p n, t) (2) compute p n+1 by maximizing the log-likelihood n m i l(p) = c ij (p n, t) log(p ij ), i=1 j=1 that is, p ij,n+1 = s i (p n, t) 1 c ij (p n, t) where s i (p n, t) = j c ij (p n, t) for i = 1,...,n. It is a well known property of the EM-algorithm that the sequence (p n ) n 0 of parameters gives an increasing (marginal) likelihood (Dempster et. al., 1977), (Lari & Young, 1990), and we see from above that the maximization part of the EM-algorithm can be carried out explicitly in each iteration of the algorithm. The only problem is the computation of c ij (p n, t). Different algorithms exist depending on the transformation f and the stochastic context-free grammar. For details the reader is referred to the literature, e.g. (Durbin et. al., 1998), (Lari & Young, 1990), and (Baker, 1979). It should be mentioned that these algorithms can be quite computationally demanding. With Λ(p) the matrix of expectations under P i0,p, we have the following theorem. Theorem 6 Suppose that t F such that f 1 (t) T 0 and p 0 Θ t then the EM-algorithm produces a sequence (p n ) n 0 satisfying p n Θ t, ρ(λ(p n )) 1 for n 1, and if p n p for n then ρ(λ(p)) 1. PROOF. First, since the EM-algorithm increases the marginal likelihood, p n Θ t for all n 1 if p 0 Θ t. Since f 1 (t) T 0 the equality obtained in (9), i.e. δ i0 + c(t)n = s(t), holds for all T with f(t) = t. The linearity of conditional expectations then gives that for all p Θ t δ i0 + c(p, t)n = s(p, t) (12) where c(p, t) and s(p, t) denote the collection of c ij (p, t) and s i (p, t) into vectors exactly as in the previous section. 7

It follows from Theorem 4 that ρ(λ(p n )) 1 for all n 1 since (12) holds. Continuity of Λ as a function of p as well as continuity of the spectral radius map shows that for an eventual limit p also ρ(λ(p)) 1 holds. Remark 7 Note that if p 0 is the maximum likelihood estimate, provided it exists and is unique, the sequence of parameters obtained by the EM-algorithm will be constantly equal to p 0 and consequently ρ(λ(p 0 )) 1. Example 8 Let E = {1, 2}, A = {a, b}, γ 11 = (1, 2), γ 12 = b, γ 21 = 1, γ 22 = a, and let P be defined by P(1, (1, 2)) = 1 P(1, b) = p 1, P(2, 1) = 1 P(2, a) = p 2 with 0 < p 1, p 2 < 1. Let the root non-terminal be 1. We consider f = f LT so that f LT (T) = b...b }{{} a }.{{..a}...b...b }{{} a }.{{..a} n 1 n k m 1 m k where n 1 > 0. Observe that any sequence starting with b can occur. Observe also that the counts c ij satisfy the following equations c 22 = c 11 c 21, c 12 = 1 + c 21, c 12 = n := k n i, and c 22 = m := i k m i. i Hence c 21 = n 1 and c 11 = m + n 1. In this example we see that f LT is in fact sufficient and that the maximum-likelihood estimates for p 1 and p 2 are given explicitly as ˆp 1 = m + n 1 m + 2n 1 The expectation matrix is Λ(p) = p 1 p 1 p 2 0 and ˆp 2 = n 1 m + n 1. and ρ(λ(p)) 1 if and only if p 2 p 1 1 1. Theorem 4 gives that ρ(λ(ˆp)) 1, but this can also be verified directly as 1 ˆp 1 1 = n m + n 1 > n 1 m + n 1 = ˆp 2. 8

4 On the size of the tree and the number of leafs It would clearly be interesting to understand the distribution of the size of the tree produced by a terminating stochastic context-free grammar in more details. For instance, the length of the sequence produced by the leaf traversal map f LT equals the number of leafs in the tree. One should expect a fundamental difference between the two cases where the embedded branching process is critical (ρ = 1) or sub-critical (ρ < 1). Considering only non-terminals, let H i denote the offspring distribution for the branching process from non-terminal i, i.e. H i (m 1,...,m n ) = p ij, j:n ij1 =m 1,...,n ijn =m n and let h i denote the corresponding generating function, i.e. h i (z 1,...,z n ) = z m 1 1 zn mn H i (m 1,...,m n ). m 1,...,m n One can easily prove that the distribution R i of the total number of nonterminals produced given that we start with one non-terminal i have generating function r i fulfilling the equation with z = (z 1,...,z n ). r i (z) = z i h i (r 1 (z),...,r n (z)) Otter (1949) used this equation for n = 1 to show that for m R 1 (m) = c α m m 3 2 + O(α m m 5 2 ) m 1 (mod q) (13) for some constants c, α 1 and q the period of H 1. The critical case is equivalent to α = 1, so the tail of R 1 has a power-law decay with exponent 3/2 in the critical case, whereas it has an exponentially light tail in the sub-critical case. A similar result for e.g. the total number of individuals, or the total number of individuals of a given type, produced by a critical or sub-critical multitype branching process doesn t seem to exist. There is, however, a result due to Good (1958) on how in principle to compute R i. In Section 3, Examples and applications, Good (1958) shows that R i (m 1,...,m n ) is the coefficient of z m 1 1 z m i 1 i zn mn in 1 rn mn I { z i j r i } i,j (14) r i r m 1 with denoting determinant. 9

log(r(m1,.)) 10 8 6 4 2 0 log(r(.,m2)) 10 8 6 4 2 0 0 1 2 3 4 5 6 log(m1) 0 1 2 3 4 5 6 log(m2) Fig. 1. The log-marginals plotted against log(m). In both cases we see a tail decay that is asymptotically linear on this log-log-plot, thus the decay is like a power function. The straight line has slope 3/2. Example 9 (Example 8 continued) The generating functions for R 1 and R 2 from Example 8 are r 1 (z 1, z 2 ) = q 1 + p 1 z 1 z 2 and r 2 (z 1, z 2 ) = q 2 + p 2 z 1 with q i = 1 p i. Using (14), this gives R 1 (m 1, m 2 ) = ( )( m1 m 2 m 2 2m 2 m 1 + 1 ) ( 1 m 1 1 m 1 ) p m 2 1 pm 1 m 2 1 2 q m 1 m 2 1 q 2m 2 m 1 +1 2 for 1 + m 2 m 1 2m 2 + 1. It still seems complicated to compute the tail behavior of the two marginals analytically, but we can investigate the distribution numerically. If we consider the case p 1 = 3/4 and p 2 = 1/3 (which is a critical example), Figure 1 shows a log-log plot of the marginal point probabilities of R 1 compared to a straight line with slope 3/2. On both graphs we see that asymptotically, the decay of the marginals of R 1 is like m 3/2. Moreover, for this example the length of the sequence f LT (T) equals c 11 + 1, cf. Example 8, which in turn equals the total number of occurrences of non-terminal 2 in the tree plus 1. Thus in this critical example, the tail of the distribution of the length of f LT (T) decays as a power-law with exponent 3/2. We may suggest that this is a general phenomenon, that for critical, terminating stochastic context-free grammars, the distribution of the length of f LT (T) has a power law decay with exponent 3/2. To obtain such a result we will have to deal with the tail behavior of the distributions R i for multitype branching processes. The method used by Otter (1949) does not seem to generalize, 10

as it relies heavily on the theory of unit variable complex functions. Dealing rigorously with this aspect of multitype branching processes seems to be an open problem. References Baker, James K. (1979), Trainable grammars for speech recognition, in: Klatt, D.H.. and Wolf, J.J., eds. Speech Communication Papers for the 97th Meeting of the Acoustical Society of America pp. 547-550. Bhatia, Rajendra (1997), Matrix Analysis (Springer Verlag). Dempster, A. P., Laird, N.M., Rubin, D.B. (1977), Maximum Likelihood from Incomplete Data via the EM algorithm, J. Roy. Statist. Soc. 39, 1-38. Durbin, R., Eddy, S., Krogh, A. and Mitchinson, G. (1998), Biological Sequence Analysis. Probabilistic models of proteins and nucleic acids (Cambridge University Press). Eddy, Sean, R. Durbin, Richard (1994), RNA sequence analysis using covariance models, Nucleic Acids Research 22, 11, 2079-2088. Harris, Theodore E. (1963), The Theory of Branching Processes (Springer- Verlag). Jagers, Peter (1975), Branching Processes with Biological Applications (John Wiley and Sons). Lari, K. and Young, S. J. (1990), The estimation of stochastic context-free grammars using the Inside-Outside algorithm, Computer Speech and Language 4, 35-56. Mode, Charles J. (1971), Multitype Branching Processes (Elsevier). Sankoff, David. (1971), Branching processes with terminal types: Application to context-free grammars, J. Appl. Prop. 8, 233-240. Seneta, E. (1981), Non-negative Matrices and Markov Chains. Second edition (Springer Verlag). Yasubumi, Sakakibara et al. (1994), Stochastic context-free grammars for trna modeling, Nucleic Acids Research 22, 23, 5112-5120. Otter, Richard (1949), The Multiplicative Process, Ann. Math. Statist. 20, 206-224. Good, I. J. (1960), Generalizations to several variables of Lagranges expansion, with applications to stochastic processes, Proc. Cambridge Philos. Soc. 56, 367 380. 11