An Inequality for Rational Functions with Applications to Some Statistical Estimation Problems

Size: px

Start display at page:

Download "An Inequality for Rational Functions with Applications to Some Statistical Estimation Problems"

Molly Briggs
6 years ago
Views:

1 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 37, NO. I, JANUARY An Inequality for Rational Functions with Applications to Some Statistical Estimation Problems P. s. Gopalakrishnan, Member, IEEE, Dimitri Kanevsky, Member, IEEE, Arthur Nidas, and David Nahamoo, Member, IEEE Abstract -The well-known Baum-Eagon inequality I31 provides an effective iterative scheme for finding a local maximum for homogeneous polynomials with positive coefticients over a domain of probability values. However, in many applications we are interested in maximizing a general rational function. We extend the Baum-Eagon inequality to rational functions. We briefly describe some of the applications of this inequality to statistical estimation problems. Index Terms -Nonlinear optimization, statistical estimation, hidden Markov models, speech recognition. I. INTRODUCTION E DESIGN an algorithm in this paper for the maximization of rational functions over linear domains. The algorithm we develop is an extension of the well-known inequality that was derived by Baum and Eagon (see 131) for an arbitrary homogeneous polynomial P(X) = P((Xij}) with nonnegative coefficients of degree d in variables Xij, i= 1,. *,p, j = 1,..,qi. Assuming that this polynomial is defined over a domain of probability values, D: xij 2 0, Cy= lxij = 1, they show how to construct a transformation T: U 4 D for some U c D such that the following property holds. Property A: For any x E U and y = T(x), P(y)> P(x) unless y = x. Polynomials of this type appear in various statistical problems dealing with the estimation of probabilistic functions of Markov processes via the maximum likelihood technique, and the previous inequality provides an effective iterative scheme for finding a local maximum (or near-maximum). A general paradigm used for maximization is an E-M algorithm (Expectation-Maximization) (see [51, [61). In certain statistical problems it was found that estimation of parameters via some other criteria that use conditional likelihood, mutual information [2], or the recently introduced If-criteria [71 can give empirically better results than estimation via maximum likelihood. These problems require finding local maxima for rational functions over domains of probability values and an analog of the Baum-Eagon inequality for rational functions will enable us to use an iterative E-M-like algorithm for maximizing these functions. We will describe just such an extension in this paper. Given the domain D (previously defined) and a rational function R(X)= S,CX)/S,(X) (where S1(X),S2(X) are Manuscript received September 1988; revised August Part of this material was presented at the 1989 International Conference on Acoustics, Speech, and Signal Processing, Glasgow, Scotland, May The authors iire with IBM Research Division, T. J. Watson Research, P.O. Box 704, Yorktown Heights NY IEEE Log Number polynomials with real coefficients in variables X = (Xij} and S,(X) has only positive values in D) we provide a way for constructing a large class of transformations T: D+ D such that the analog of the Property A (with R(X) instead of P(X)) holds. Property B: For any x E D and y = T(x), R(y)> R(x) unless y = x. If a transformation T: D -+ D satisfies Property B we will say that T is a growth transformation of D for R(X). In this paper we establish relations between growth transformations of polynomials and rational functions, thereby providing a method for constructing growth transformations for rational functions. The rest of this paper is organized as follows. In Section I1 we will show that the growth transformations for a rational function R(X) can be obtained from growth transformations for a family of homogeneous polynomials with nonnegative coefficients attached in a certain way to the rational function R(X) and the domain D. In Section 111 we reproduce the Baum and Eagon result for polynomials. Given a rational function our method provides not one but a whole family of growth transformations. Some of these will be described in Section IV. A simple example illustrating our algorithm is also presented in that section. In Section V we present a discussion on some applications of our approach to statistical estimation problems. Concluding remarks are presented in Section VI. 11. A REDUCTION OF THE CASE OF RATIONAL FUNCTIONS TO POLYNOMIALS In this section we show how to reduce the problem of finding growth transformations for rational functions to the corresponding problem for polynomials. Recall that R( X ) = S,(X)/S,(X) is a ratio of two polynomials S,(X),S,(X) in variables X={Xij}, i=l;..,p, j=l;..,q, defined over a domain For simplicity we assume throughout this paper that S2(x) > 0 for any x E D. All statements in this paper can be easily modified so that they will hold for more general rational functions. This generalization will appear elsewhere. We are looking for a (growth) transformntion T: D -+ D such that the Property B holds X/91/0I $ I IEEE

2 lox IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 37. NO. I, JANUARY 1991 Our reduction of the previous problem to an instance that involves only homogeneous polynomials having nonnegative coefficients will proceed along the following steps. First, we reduce the problem of finding a growth transformation for rational functions to one of finding a growth transformation for a specially formed polynomial. We then show how to reduce this problem to one of finding growth transformations for a nonhomogeneous polynomial with nonnegative coefficients. Finally, we show that the inequality derived by Baum and Eagon can be easily extended to nonhomogeneous polynomials with nonnegative coefficients. In what follows we describe these steps in greater detail. Step 1) First we show that for any x E D there exists a polynomial P,(X)' such that if P,(y) > P,(x), y E D then R(y)> R(x). For this it is enough to set P,(X)= S,(X)- R(x)S,(X). Indeed, it is easy to see that P,(x)=O and therefore if P,(y) > P,(x) = 0 then R(y) > R(x). Now suppose that for each polynomial?,.(xi, x E D we could construct a growth transformation T, of D, such that P,(T,(y))> P,(y) for any y E D unless y = T,(y). Then we could define a growth transformation T of D for R(X) as follows: T(Y) =T,.(y) (2) for any y E D (the fact the T is a growth transformation would follow from the fact that R(T(y))> R(y) if PJT,.(y)) > Py(y)). Thus we see that we can construct growth transformations for rational functions if we could do so for an arbitrary polynomial. Step 2) In this step, the most crucial for our work, we show that for any polynomial P,.(X) there exists a polynomial PJX) with nonnegative -coefficients such that any growth transformation of D for the P$X) is also a growth transformation for P,,(X). It will easily follow from the following general facts. Lemma 2.1: Let P(X)= P((X,j]) be a polynomial with real coefficients in variables XI,, i = 1,..., p, j = 1,..., q,. Let a domain D be x,, 2 0, Cy- Ix,, = 1. Then, (a) There exists a polynomial C(X) such that the polynomial PYX) = P(X)+ C(X> has only nonnegative coefficients and such that the value C(x) at any x E D is a constant (independent of x). (b) The set of growth transformations of D for P(X) coincides with the set of growth transformations for P'( X ). Proof: (a) Let d be the degree of P(X) and a its minimal negative coefficient (a = 0 if no negative coefficient exists). Then a polynomial, is constant in D (its value in D is equal to - a(p + 1)" since E?='X,, = 1). Now, every possible monomial in P(X) also occurs in C(X). Since a c 0 is the smallest negative cocfficient in P(X), it is easily seen that the sum of the coeffi- 'In order to avoid complex notation we do not explicitly indicate that P, is associated with the rational function R. This will be clear from context. cients of corresponding monomials in P(X) and C(X) is nonnegative. b) Since P YX) and P (X) differ only by a constant in D, it is clear that P'(y) > PYX) for any y, x E D if and only if P(y)> P(x), proving the second statement of the lemma. U Step 3) Now in order to be able to apply the Baum-Eagon result it remains to show that the problem of finding a growth transformation for a polynomial with nonnegative coefficients can be reduced to the same problem for a homogeneous polynomial with nonnegative coefficients. For this we proceed as follows. Let P'(X) be as in the Step 21, let d be a degree of PYX) and let us consider the homogeneous polynomial P"( Y> = P"( {y/,n}) y:+ 1. I P'( ( Yj / yi+ I. I} ) (4) invariables q,,,, I=l;..,p+l, m=l;..,q, where q,,+l= 1. In other words P"(Y) is obtained from P'(X) by substitut- ing XIj = Y, / I and multiplication of P'({Y,/ y,+ by y:+l,l.,, '1) Since the new polynomial P"(Y) has one more variable than P' growth transformation for P" will be considered in the following domain D': yij = 1, y,, 2 0, i = 1,..., p + 1, j = 1,...>q;. ;= 1 The additional variable YP+',' is equal to 1 in D'. Therefore, in fact, the pair (P"(Y>, 0') is an equivalent representation of the pair (P'( X), D) in the sense that there is the bijection f: D + D', mapping x =(xij]~ D into x'=(y~,~], such that x,, = yij for (i, j )# ( p + 1,l) and such that for any x E DP'(x) = P"(f(x)). Thus, if T is a growth transformation for P" then the composition of the maps f-'.t.f is a growth transformation for P'. This completes the third step. Summarizing all three steps we see that for a rational function R(X) over domain D one can construct a family of homogeneous polynomials with nonnegative coefficients parameterized by points x in D such that a growth transformation of D for R(X) can be reconstructed from a family of growth transformations for these polynomials. Thus, to give explicit formulas for growth transformations for rational functions, we will require the inequality that was derived by Baum and Eagon. We will review their result in the next section AN INEQUALITY FOR A POLYNOMIAL (5) In this section we state one theorem from [3] and give its extensions in the corollary. Theorem 3.1 (L. E. Baum, J. I. Eagon): Let P(X)= P({Xl,}) be a polynomial with nonnegative coefficients homogeneous of degree d in its variables A',,. Let x =(x,,) be any point of the domain D: xlj 2 0, C>lIx,J = 1, i=l,..., p, j = 1;.., q, such that dp dp( where -(x) denotes the value of at x. dx dx Let y = T(x) = T({xl,)) denote the point of D whose i, j

3 GOPALAKRISHNAN et U/.: INEQUALITY FOR RATIONAL FUNCTIONS WITH APPLICATIONS 100 coordinate is dp xij ax,, ( xij 1 Then P(T(x)) > P(x) unless T(x) = x. The version of the proof of this theorem that was given in [3] essentially used the homogeneity condition for a polynomial. In order to extend the theorem to nonhomogeneous polynomials one can use the construction from Section 11. More precisely, let P be as in Theorem 3.1 but possibly nonhomogeneous. Let and (6) Let f: D.+ D' be the map that was defined in Section I1 and x'= f(x). Then implies that and Theorem 3.1 is thus applicable to Q over D'. Using the facts that aq/axij(x')=ap/axij(x), Q differs from P" in D' only by a constant (= l), and (P", 0') is the equivalent representation of (P, D) in the sense of Step 3) in Section 11, we have the following corollary. Corollary 3.2: Theorem 3.1 remains true for nonhomogeneous polynomials with nonnegative coefficients. A similar extension of Theorem 3.1 can be found in [4]. Remark: Adding to P a suitable auxiliary polynomial (say (Er:?, j=lxlj)n> that is constant in D one can obtain a new polynomial with the property that the denominator in (6) is nonzero for all x E D and thus the transformation corresponding to this polynomial is correctly defined for all x E D. All polynomials that are derived from rational functions in subsequent parts of this paper will, in fact, satisfy the previous property. IV. GROWTH TRANSFORMATIONS RATIONAL FUNCTIONS In this section we keep the notation of Section I1 and in addition we give some new definitions. Given a rational function R(X) and a real number C let and ri(x;c) 4 I = rl,(x;c). (13) J = I We call C admissible (for R(X)) if for any i, j, and x E D, Tlj(x;C)2 0, and I',(x;C)> 0. For any admissible C we consider the transformation TC of D whose ij coordinate is defined as (It is easy to see that this definition is correct, i.e., TC maps D into itself.) Now we can describe some growth transformations of R(X). Theorem 4.1: Let R(X) be a rational function in variables XI,. There exists a constant NR such that any C 2 NR is admissible for R(X) and for any such C a map TC is a growth transformation of D for R(X). Proof: In the proof of this theorem we will follow steps from Section 11. From Step 2) it follows that for any x E D there exists a constant N, such that the following polynomial has only nonnegative coefficients- P,(X) + C,(X) where P%9, C,(X)=N,( 1 = I, c J - 1 x,j+l)l' (15) and d is the degree of P,. Since by assumption R(X) has no poles in the compact set D one can find N 2 N, for all x E D such that for any C 2 N multiplied by a factor (that will be derived below), the transformation T,' of D constructed from the polynomial PJX) + C,(X) via (6) is a growth transformation for it. For this C a family of growth transformations for R(X) can be constructed following (2): T'( x) = TF( x) (16) for all x E D. In order to compute explicitly these growth transformations let us note that the polynomial / P.Cl, 1 (' c xl.,+l),+, d - I i=l,j=l =d( Xl,+l) (17) ax;; i= 1, j =l is equal to d( p + l)"-' at any point in D. Combining this remark with the formulas (6) and (16), we have where C 2 NR = Nd(p + l)"-' and N = max, N,. This completes the proof of the theorem. 0 The class of growth transformations T,$ that was considered in Theorem 4.1 was derived from certain family of polynomials C,(X). Considering another family of polynomials one can derive new growth transformation for R(X). The following extension of Theorem 4.1 illustrates this.

4 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 37, NO. I, JANUARY 1991 Let e = {C;), i = 1;.., p be a vector in p-dimensional real space and let CJ( x, e) = x,, ( 5 ( x ) + c,) rl, r,(x;t) = rlj(x;q. (20) J = 1 As we did before, we call a vectpr e admissible f9r R(X) if for any i, j, and x E D, T,,(x;C) 2 0, and T,(x;C) > 0. For any admissible vector C we consider the transformation of D that is defined as In this notation we have the following result. Theorem 4.2: Let R(X) be a rational function in variables Xi;. Thece exists a constant NR such that any p-dimensional vector C with all its comppnents Cj F NR is admissible for R(X) and for any such C a map TC is a growth transformation of D for R(X). Proof: In fact the constant NR in the formulation of this theorem can be taken to be the same as the one in the proof of Theorem 4.1. It is easy to see that if we take the admissible polynomials that were considered in the proof of Theorem 4.1 and add to them the polynomial then the new family of polynomials will be admissible for R(X). Repeating the computations similar to (18) we come to the formula (21). This proves Theorem It makes sense also to give the general description of growth transformations attached to arbitrary polynomials that were considered in Lemma 2.1. Let 8 = (C,(X), x E D} be a family of polynomials in variables X,i. Let Example: To illustrate our algorithm, let us consider a simple example. Let us say that we are interested in obtaining a local maximum of the function R(x, y, z) = x2/(xz + y2 + z2> where variables x, y, z satisfy constrains x, y, z 2 0, x + y + z = 1. We obtain the maximization algorithm by following our method. Start from some x~,, yo, zo such that xo > 0, yo > 0, zo > 0, xl) + yo + zl) = 1. Iteration index i + 0. k + R(xl, y,, 2,). (Obtain a polynomial with nonegative coefficients) P(x, Y 9 2) = x2 - k(x2 + y 2+ 2 ) + k (x + y + 2). (26) Note that the last term is constant over the domain x > 0, y > 0, z > 0, x + y + z = 1. This polynomial is already homogeneous so the update formulas of (6) can be applied. Use the update formulas from (6). Let xap(x,y, z)/jx + YaP(x,Y, z)/ay + zap( X, y, z)/~z = 2x2 +4kxy +4kx~ +4kyz Y x2+kxy+kxz + ] (27) [ x2 +2hy +2kxz +2kYZ x,.y,,z, kxy + kyz [ x2 +2ky +2kz +2kw ri(x;b) 1, = rij(x;8). (24) j= 1 We say that this family of polynomials 8 is admissible for R(X) if for any i, j and x E D, rjj(x; B) 2 0 and Tj(x; 8) > 0. These admissible polynomial families give rise to the following set of transformations of D, Under the previous notation, we have the following theorem. Theorem 4.3: Let R(X) be a rational function in variables Xi,. Let 8 = (C,(X>, x E D} be an admissible family of polynomials for R(X) such that for any x E D, F,(X)+ C,( X 1 has nonnegative coefficients and the polynomial C,(X) is constant in D. Then T defined by (25) is a growth transformation of D. Proof: Similar to the proof of the previous two theorems. 0 kxz + kyz Increment iteration index i by 1 and go to Step 2). Plots for 9 iterations of this algorithm for values R(x, y, z) and x, y, z starting from (x, y, z) = (0.3,0.3,0.4) are given on Fig. 1, and in Figs. 2(a)-2(d) we give plots for 9 iterations for values R(x, y, z) and four different starting points (x, y, z). As can be seen from these graphs, the value of the function improves rapidly. V. APPLICATIONS The motivation for the work reported in this paper came from a study of the parameter estimation problem in speech recognition. Certain criteria used in the estimation of the parameters of hidden Markov models for words in speech recognition result in the maximization of rational functions

5 GOPALAKRISHNAN et al.: INEQUALITY FOR RATIONAL FUNCTIONS WITH APPLICATIONS I I 1 I I, Iteration Index Fig. 1. R, X, Y, and Z versus iteration index starting from (0.3, 0.3, 0.4) l I I I l I I l l l l l l l l l I over a domain of probability values and a fast maximization algorithm for such functions is highly desirable. We discuss some details of this application next. Let W,, k = 1,..., V be hidden Markov models for words in the vocabulary. Let us assume for notational convenience that the states in all the models have unique numbers i, i = 1,2, * *, S. Typically, the word models are constructed by concatenating models for the phones (basic units of speech) as shown in Fig. 3 and Fig. 4. In the simplest case the output on the arcs of these models are acoustic labels. The acoustic signal is sampled, digitized, and feature vectors are extracted from it. These are vector quantized to produce a sequence of labels y'= (yl;.., yn). A hidden Markov model for a word2 Wk is a parametric probability function giving the likelihood of a sequence of labels $k = ( y,,.., Y,+~) on the hypothesis that wk was uttered. The parameters 0 of these models wk are the probabilities a ={aij}, where aij is the 2For convenience, we refer to both the word and the associated HMM as W,.

6 I12 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 37, NO. I, JANUARY 1991 tion that was being maximized was Fig. 3. Hidden Markov model of phone. I where h is a parameter of the algorithm. In each iteration of the maximization algorithm we used a transformation based 9n (18). In each iteration the updated parameters 2,, and brlk were obtained using the following: Fig. 4. Hidden Markov model for word Model probability of taking the transition to state j from state i in some model (the so called transition probabilities) and the probabilities b = {b,,k} where brlk is the probability of generating the output label k while taking the transition from state i to state j in some model (the so-called output probabilities). The typical training problem is to estimate the transition and output probabilities for the models given a sample of acoustics y= yl, y2,..., y,,,arising from the utterance of a sequence of words Y= W,,W,,. * ' W,,,,, t, E {l;..) V}. Maximum likelihood estimation of 0 consists of choosing the parameters so as to maximize the sample likelihood m Po( YlY) = n Po( $IT(). (30) r=l This can be achieved via the E-M algorithm based on the Baum-Eagon inequality. However, when the goal of parameter estimation is the design of a good classifier such as in a speech recognition system, it is sometimes helpful to have the estimation procedure maximize the value of some rational function of the sample likelihood P,(flY) (see [21, 171). For instance we might desire to maximize the conditional likelihood P,(YIY) which, for isolated word speech can be written as =n v = Po ( y', I Wf, P ( 4, ) c P,( $lw,) P( Wk) k=i. (31) The likelihood functions Po( $ 1 W,,) are polynomials in the parameters 0 =(a,,, b,,k) and hence the conditional likelihood is a rational function of the parameters. Other estimation criteria that give rise to rational objective functions are the maximum mutual information criterion [21 and its generalization, the H-criterion [71. (For a discussion of the motivation for using such a criterion see 121 or [7].) The algorithm described in this paper provides a fast method for estimating the parameters of the models using such criteria. A. Implementation We implemented this algorithm for estimating the parameters ai+ bijk using the H-criterion [7]. The objective func- Here x is the point in the parameter domain defined by the values a,, and brjk in the current iteration. The partial derivative of log H, appearing in the above formulas can be easily seen to be correct. The update formulas in (18) require the partial derivatives a/aa,,(n, - kd,) where No denotes the numerator of H,, D, its denominator, and k is the current value of H,. Clearly, =-[-.,---Do i a 1. (35) No aajj D, dajj In the update formulas 1/N, cancels out in the numerator and denominator, changing only the value of the constant C(x), which does not affect the correctness of the formulas. Similarly for the update formulas for brjk. In a practical implementation, the use of logs makes it easier to keep the values within the dynamic range of the computer. Since faster convergence requires a small constant c in (18) and determination of such a constant is rather involved, we used an approximate version of this transformation. The values of C(x) and D(x) in each iteration were chosen to make all the derivatives positive, C(x)=max ( max i,j ( -~ (x,);d} + (36) where E is a small positive constant. We used a forward-backward algorithm (see [l]) to compute the derivatives that go into this transformation. Notice that for sufficiently large values of C(x) and D(x) one can prove that the transformation given by (33) and (34) is a growth transformation for the H-criterion. However, for the choice of values in (36) and (37) we cannot prove this. In other words, the algorithm is not guaranteed to increase the value of the objective function in each iteration. However,

7 ~ GOPALAKRISHNAN et al.: INEQUALITY FOR RATIONAL FUNCTIONS WITH APPLICATIONS I13 TABLE I PERCENTAGE IMPROVEMENT IN OBJECTIVE FUNCTION* h Gradient Hill Climbing Our Algorithm % 21.7% % 22.4% % 16.5% *Obtained by six iterations of a gradient hill climbing algorithm and the algorithm described in Section IV. the values in (36) and (37) are easy to compute and experiments show that the objective function increases very rapidly from one iteration to the next, illustrating the usefulness of this modified version of our algorithm in practical problems. This algorithm was used to estimate the parameters of hidden Markov models for the words in the twenty thousand word speech recognition system at IBM Research. A comparison of the percentage increase in the value of the objective function using this algorithm and a gradient hill climbing algorithm for different values of the parameter h in (32) are shown in Table I. These results were obtained after going through six iterations of each algorithm. As is evident from this table, our algorithm succeeds in increasing the objective function an order of magnitude or more above what is achieved by the gradient hill climbing algorithm for the same number of iterations. In fact, the gradient algorithm requires fifty or more iterations to obtain a comparable improvement in the objective function. Notice that the amount of computation required in each iteration of our algorithm is the same as that required by the gradient hill climbing algorithm that we had implemented. In practice too much computation power is required to run about fifty iterations of such an algorithm on practical data. VI. CONCLUSION We presented an algorithm for the maximization of certain rational functions defined over domains of probability values. This algorithm is an extension of a method presented first by Baum and Eagon [3] and by others [4], Our algorithm finds application in several areas including problems arising in statistics. The motivation for this work arose from studying the estimation of parameters of a speech recognition system. The discussion presented in Section V shows that our algorithm is very useful in this practical situation, being effective even in an approximate form. It can be expected that the algorithm will prove effective in many other practical applications. ACKNOWLEDGMENT The authors thank the referees for several suggestions that improved the quality of the paper. REFERENCES [l] L. R. Bahl, F. Jelinek, and R. L. Mercer, A maximum likelihood approach to continuous speech recognition, IEEE Trans. Pattern Anal. Machine Intell., vol. PAMI-5, pp , Mar [2] L. R. Bahl, P. F. Brown, P. V. de Souza, and R. L. Mercer, Maximum mutual information estimation of hidden Markov model parameters for speech recognition, in Proc. IEEE Int. Conf. Acoust. Speech Signal Processing, 1986, pp [3] L. E. Baum, and J. A. Eagon, An inequality with applications to statistical prediction for functions of Markov processes and to a model of ecology, Bull. Amer. Math. Soc., vol. 73, pp , [4] L. E. Baum and G. Sell, Growth transformations for functions on manifolds, Pacific J. Math., vol. 27, no. 2, pp , [5] L. E. Baum, T. Petrie, G. Soules, and N. Weiss, A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains, Ann. Math. Statk., vol. 41, no. 1, pp , [6] A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. Roy. Soc. Ser. B., vol. 39, pp. 1-38, [7] P. S. Gopalakrishnan, D. Kanevsky, A. Nadas, D. Nahamoo, and M. A. Picheny, Decoder selection based on cross-entropies, in Proc. In?. Conf. Acoust. Speech Signal Processing, 1988, pp

Outline of Today s Lecture

Outline of Today s Lecture University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005 Jeff A. Bilmes Lecture 12 Slides Feb 23 rd, 2005 Outline of Today s