An Inequality for Rational Functions with Applications to Some Statistical Estimation Problems

Size: px
Start display at page:

Download "An Inequality for Rational Functions with Applications to Some Statistical Estimation Problems"

Transcription

1 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 37, NO. I, JANUARY An Inequality for Rational Functions with Applications to Some Statistical Estimation Problems P. s. Gopalakrishnan, Member, IEEE, Dimitri Kanevsky, Member, IEEE, Arthur Nidas, and David Nahamoo, Member, IEEE Abstract -The well-known Baum-Eagon inequality I31 provides an effective iterative scheme for finding a local maximum for homogeneous polynomials with positive coefticients over a domain of probability values. However, in many applications we are interested in maximizing a general rational function. We extend the Baum-Eagon inequality to rational functions. We briefly describe some of the applications of this inequality to statistical estimation problems. Index Terms -Nonlinear optimization, statistical estimation, hidden Markov models, speech recognition. I. INTRODUCTION E DESIGN an algorithm in this paper for the maximization of rational functions over linear domains. The algorithm we develop is an extension of the well-known inequality that was derived by Baum and Eagon (see 131) for an arbitrary homogeneous polynomial P(X) = P((Xij}) with nonnegative coefficients of degree d in variables Xij, i= 1,. *,p, j = 1,..,qi. Assuming that this polynomial is defined over a domain of probability values, D: xij 2 0, Cy= lxij = 1, they show how to construct a transformation T: U 4 D for some U c D such that the following property holds. Property A: For any x E U and y = T(x), P(y)> P(x) unless y = x. Polynomials of this type appear in various statistical problems dealing with the estimation of probabilistic functions of Markov processes via the maximum likelihood technique, and the previous inequality provides an effective iterative scheme for finding a local maximum (or near-maximum). A general paradigm used for maximization is an E-M algorithm (Expectation-Maximization) (see [51, [61). In certain statistical problems it was found that estimation of parameters via some other criteria that use conditional likelihood, mutual information [2], or the recently introduced If-criteria [71 can give empirically better results than estimation via maximum likelihood. These problems require finding local maxima for rational functions over domains of probability values and an analog of the Baum-Eagon inequality for rational functions will enable us to use an iterative E-M-like algorithm for maximizing these functions. We will describe just such an extension in this paper. Given the domain D (previously defined) and a rational function R(X)= S,CX)/S,(X) (where S1(X),S2(X) are Manuscript received September 1988; revised August Part of this material was presented at the 1989 International Conference on Acoustics, Speech, and Signal Processing, Glasgow, Scotland, May The authors iire with IBM Research Division, T. J. Watson Research, P.O. Box 704, Yorktown Heights NY IEEE Log Number polynomials with real coefficients in variables X = (Xij} and S,(X) has only positive values in D) we provide a way for constructing a large class of transformations T: D+ D such that the analog of the Property A (with R(X) instead of P(X)) holds. Property B: For any x E D and y = T(x), R(y)> R(x) unless y = x. If a transformation T: D -+ D satisfies Property B we will say that T is a growth transformation of D for R(X). In this paper we establish relations between growth transformations of polynomials and rational functions, thereby providing a method for constructing growth transformations for rational functions. The rest of this paper is organized as follows. In Section I1 we will show that the growth transformations for a rational function R(X) can be obtained from growth transformations for a family of homogeneous polynomials with nonnegative coefficients attached in a certain way to the rational function R(X) and the domain D. In Section 111 we reproduce the Baum and Eagon result for polynomials. Given a rational function our method provides not one but a whole family of growth transformations. Some of these will be described in Section IV. A simple example illustrating our algorithm is also presented in that section. In Section V we present a discussion on some applications of our approach to statistical estimation problems. Concluding remarks are presented in Section VI. 11. A REDUCTION OF THE CASE OF RATIONAL FUNCTIONS TO POLYNOMIALS In this section we show how to reduce the problem of finding growth transformations for rational functions to the corresponding problem for polynomials. Recall that R( X ) = S,(X)/S,(X) is a ratio of two polynomials S,(X),S,(X) in variables X={Xij}, i=l;..,p, j=l;..,q, defined over a domain For simplicity we assume throughout this paper that S2(x) > 0 for any x E D. All statements in this paper can be easily modified so that they will hold for more general rational functions. This generalization will appear elsewhere. We are looking for a (growth) transformntion T: D -+ D such that the Property B holds X/91/0I $ I IEEE

2 lox IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 37. NO. I, JANUARY 1991 Our reduction of the previous problem to an instance that involves only homogeneous polynomials having nonnegative coefficients will proceed along the following steps. First, we reduce the problem of finding a growth transformation for rational functions to one of finding a growth transformation for a specially formed polynomial. We then show how to reduce this problem to one of finding growth transformations for a nonhomogeneous polynomial with nonnegative coefficients. Finally, we show that the inequality derived by Baum and Eagon can be easily extended to nonhomogeneous polynomials with nonnegative coefficients. In what follows we describe these steps in greater detail. Step 1) First we show that for any x E D there exists a polynomial P,(X)' such that if P,(y) > P,(x), y E D then R(y)> R(x). For this it is enough to set P,(X)= S,(X)- R(x)S,(X). Indeed, it is easy to see that P,(x)=O and therefore if P,(y) > P,(x) = 0 then R(y) > R(x). Now suppose that for each polynomial?,.(xi, x E D we could construct a growth transformation T, of D, such that P,(T,(y))> P,(y) for any y E D unless y = T,(y). Then we could define a growth transformation T of D for R(X) as follows: T(Y) =T,.(y) (2) for any y E D (the fact the T is a growth transformation would follow from the fact that R(T(y))> R(y) if PJT,.(y)) > Py(y)). Thus we see that we can construct growth transformations for rational functions if we could do so for an arbitrary polynomial. Step 2) In this step, the most crucial for our work, we show that for any polynomial P,.(X) there exists a polynomial PJX) with nonnegative -coefficients such that any growth transformation of D for the P$X) is also a growth transformation for P,,(X). It will easily follow from the following general facts. Lemma 2.1: Let P(X)= P((X,j]) be a polynomial with real coefficients in variables XI,, i = 1,..., p, j = 1,..., q,. Let a domain D be x,, 2 0, Cy- Ix,, = 1. Then, (a) There exists a polynomial C(X) such that the polynomial PYX) = P(X)+ C(X> has only nonnegative coefficients and such that the value C(x) at any x E D is a constant (independent of x). (b) The set of growth transformations of D for P(X) coincides with the set of growth transformations for P'( X ). Proof: (a) Let d be the degree of P(X) and a its minimal negative coefficient (a = 0 if no negative coefficient exists). Then a polynomial, is constant in D (its value in D is equal to - a(p + 1)" since E?='X,, = 1). Now, every possible monomial in P(X) also occurs in C(X). Since a c 0 is the smallest negative cocfficient in P(X), it is easily seen that the sum of the coeffi- 'In order to avoid complex notation we do not explicitly indicate that P, is associated with the rational function R. This will be clear from context. cients of corresponding monomials in P(X) and C(X) is nonnegative. b) Since P YX) and P (X) differ only by a constant in D, it is clear that P'(y) > PYX) for any y, x E D if and only if P(y)> P(x), proving the second statement of the lemma. U Step 3) Now in order to be able to apply the Baum-Eagon result it remains to show that the problem of finding a growth transformation for a polynomial with nonnegative coefficients can be reduced to the same problem for a homogeneous polynomial with nonnegative coefficients. For this we proceed as follows. Let P'(X) be as in the Step 21, let d be a degree of PYX) and let us consider the homogeneous polynomial P"( Y> = P"( {y/,n}) y:+ 1. I P'( ( Yj / yi+ I. I} ) (4) invariables q,,,, I=l;..,p+l, m=l;..,q, where q,,+l= 1. In other words P"(Y) is obtained from P'(X) by substitut- ing XIj = Y, / I and multiplication of P'({Y,/ y,+ by y:+l,l.,, '1) Since the new polynomial P"(Y) has one more variable than P' growth transformation for P" will be considered in the following domain D': yij = 1, y,, 2 0, i = 1,..., p + 1, j = 1,...>q;. ;= 1 The additional variable YP+',' is equal to 1 in D'. Therefore, in fact, the pair (P"(Y>, 0') is an equivalent representation of the pair (P'( X), D) in the sense that there is the bijection f: D + D', mapping x =(xij]~ D into x'=(y~,~], such that x,, = yij for (i, j )# ( p + 1,l) and such that for any x E DP'(x) = P"(f(x)). Thus, if T is a growth transformation for P" then the composition of the maps f-'.t.f is a growth transformation for P'. This completes the third step. Summarizing all three steps we see that for a rational function R(X) over domain D one can construct a family of homogeneous polynomials with nonnegative coefficients parameterized by points x in D such that a growth transformation of D for R(X) can be reconstructed from a family of growth transformations for these polynomials. Thus, to give explicit formulas for growth transformations for rational functions, we will require the inequality that was derived by Baum and Eagon. We will review their result in the next section AN INEQUALITY FOR A POLYNOMIAL (5) In this section we state one theorem from [3] and give its extensions in the corollary. Theorem 3.1 (L. E. Baum, J. I. Eagon): Let P(X)= P({Xl,}) be a polynomial with nonnegative coefficients homogeneous of degree d in its variables A',,. Let x =(x,,) be any point of the domain D: xlj 2 0, C>lIx,J = 1, i=l,..., p, j = 1;.., q, such that dp dp( where -(x) denotes the value of at x. dx dx Let y = T(x) = T({xl,)) denote the point of D whose i, j

3 GOPALAKRISHNAN et U/.: INEQUALITY FOR RATIONAL FUNCTIONS WITH APPLICATIONS 100 coordinate is dp xij ax,, ( xij 1 Then P(T(x)) > P(x) unless T(x) = x. The version of the proof of this theorem that was given in [3] essentially used the homogeneity condition for a polynomial. In order to extend the theorem to nonhomogeneous polynomials one can use the construction from Section 11. More precisely, let P be as in Theorem 3.1 but possibly nonhomogeneous. Let and (6) Let f: D.+ D' be the map that was defined in Section I1 and x'= f(x). Then implies that and Theorem 3.1 is thus applicable to Q over D'. Using the facts that aq/axij(x')=ap/axij(x), Q differs from P" in D' only by a constant (= l), and (P", 0') is the equivalent representation of (P, D) in the sense of Step 3) in Section 11, we have the following corollary. Corollary 3.2: Theorem 3.1 remains true for nonhomogeneous polynomials with nonnegative coefficients. A similar extension of Theorem 3.1 can be found in [4]. Remark: Adding to P a suitable auxiliary polynomial (say (Er:?, j=lxlj)n> that is constant in D one can obtain a new polynomial with the property that the denominator in (6) is nonzero for all x E D and thus the transformation corresponding to this polynomial is correctly defined for all x E D. All polynomials that are derived from rational functions in subsequent parts of this paper will, in fact, satisfy the previous property. IV. GROWTH TRANSFORMATIONS RATIONAL FUNCTIONS In this section we keep the notation of Section I1 and in addition we give some new definitions. Given a rational function R(X) and a real number C let and ri(x;c) 4 I = rl,(x;c). (13) J = I We call C admissible (for R(X)) if for any i, j, and x E D, Tlj(x;C)2 0, and I',(x;C)> 0. For any admissible C we consider the transformation TC of D whose ij coordinate is defined as (It is easy to see that this definition is correct, i.e., TC maps D into itself.) Now we can describe some growth transformations of R(X). Theorem 4.1: Let R(X) be a rational function in variables XI,. There exists a constant NR such that any C 2 NR is admissible for R(X) and for any such C a map TC is a growth transformation of D for R(X). Proof: In the proof of this theorem we will follow steps from Section 11. From Step 2) it follows that for any x E D there exists a constant N, such that the following polynomial has only nonnegative coefficients- P,(X) + C,(X) where P%9, C,(X)=N,( 1 = I, c J - 1 x,j+l)l' (15) and d is the degree of P,. Since by assumption R(X) has no poles in the compact set D one can find N 2 N, for all x E D such that for any C 2 N multiplied by a factor (that will be derived below), the transformation T,' of D constructed from the polynomial PJX) + C,(X) via (6) is a growth transformation for it. For this C a family of growth transformations for R(X) can be constructed following (2): T'( x) = TF( x) (16) for all x E D. In order to compute explicitly these growth transformations let us note that the polynomial / P.Cl, 1 (' c xl.,+l),+, d - I i=l,j=l =d( Xl,+l) (17) ax;; i= 1, j =l is equal to d( p + l)"-' at any point in D. Combining this remark with the formulas (6) and (16), we have where C 2 NR = Nd(p + l)"-' and N = max, N,. This completes the proof of the theorem. 0 The class of growth transformations T,$ that was considered in Theorem 4.1 was derived from certain family of polynomials C,(X). Considering another family of polynomials one can derive new growth transformation for R(X). The following extension of Theorem 4.1 illustrates this.

4 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 37, NO. I, JANUARY 1991 Let e = {C;), i = 1;.., p be a vector in p-dimensional real space and let CJ( x, e) = x,, ( 5 ( x ) + c,) rl, r,(x;t) = rlj(x;q. (20) J = 1 As we did before, we call a vectpr e admissible f9r R(X) if for any i, j, and x E D, T,,(x;C) 2 0, and T,(x;C) > 0. For any admissible vector C we consider the transformation of D that is defined as In this notation we have the following result. Theorem 4.2: Let R(X) be a rational function in variables Xi;. Thece exists a constant NR such that any p-dimensional vector C with all its comppnents Cj F NR is admissible for R(X) and for any such C a map TC is a growth transformation of D for R(X). Proof: In fact the constant NR in the formulation of this theorem can be taken to be the same as the one in the proof of Theorem 4.1. It is easy to see that if we take the admissible polynomials that were considered in the proof of Theorem 4.1 and add to them the polynomial then the new family of polynomials will be admissible for R(X). Repeating the computations similar to (18) we come to the formula (21). This proves Theorem It makes sense also to give the general description of growth transformations attached to arbitrary polynomials that were considered in Lemma 2.1. Let 8 = (C,(X), x E D} be a family of polynomials in variables X,i. Let Example: To illustrate our algorithm, let us consider a simple example. Let us say that we are interested in obtaining a local maximum of the function R(x, y, z) = x2/(xz + y2 + z2> where variables x, y, z satisfy constrains x, y, z 2 0, x + y + z = 1. We obtain the maximization algorithm by following our method. Start from some x~,, yo, zo such that xo > 0, yo > 0, zo > 0, xl) + yo + zl) = 1. Iteration index i + 0. k + R(xl, y,, 2,). (Obtain a polynomial with nonegative coefficients) P(x, Y 9 2) = x2 - k(x2 + y 2+ 2 ) + k (x + y + 2). (26) Note that the last term is constant over the domain x > 0, y > 0, z > 0, x + y + z = 1. This polynomial is already homogeneous so the update formulas of (6) can be applied. Use the update formulas from (6). Let xap(x,y, z)/jx + YaP(x,Y, z)/ay + zap( X, y, z)/~z = 2x2 +4kxy +4kx~ +4kyz Y x2+kxy+kxz + ] (27) [ x2 +2hy +2kxz +2kYZ x,.y,,z, kxy + kyz [ x2 +2ky +2kz +2kw ri(x;b) 1, = rij(x;8). (24) j= 1 We say that this family of polynomials 8 is admissible for R(X) if for any i, j and x E D, rjj(x; B) 2 0 and Tj(x; 8) > 0. These admissible polynomial families give rise to the following set of transformations of D, Under the previous notation, we have the following theorem. Theorem 4.3: Let R(X) be a rational function in variables Xi,. Let 8 = (C,(X>, x E D} be an admissible family of polynomials for R(X) such that for any x E D, F,(X)+ C,( X 1 has nonnegative coefficients and the polynomial C,(X) is constant in D. Then T defined by (25) is a growth transformation of D. Proof: Similar to the proof of the previous two theorems. 0 kxz + kyz Increment iteration index i by 1 and go to Step 2). Plots for 9 iterations of this algorithm for values R(x, y, z) and x, y, z starting from (x, y, z) = (0.3,0.3,0.4) are given on Fig. 1, and in Figs. 2(a)-2(d) we give plots for 9 iterations for values R(x, y, z) and four different starting points (x, y, z). As can be seen from these graphs, the value of the function improves rapidly. V. APPLICATIONS The motivation for the work reported in this paper came from a study of the parameter estimation problem in speech recognition. Certain criteria used in the estimation of the parameters of hidden Markov models for words in speech recognition result in the maximization of rational functions

5 GOPALAKRISHNAN et al.: INEQUALITY FOR RATIONAL FUNCTIONS WITH APPLICATIONS I I 1 I I, Iteration Index Fig. 1. R, X, Y, and Z versus iteration index starting from (0.3, 0.3, 0.4) l I I I l I I l l l l l l l l l I over a domain of probability values and a fast maximization algorithm for such functions is highly desirable. We discuss some details of this application next. Let W,, k = 1,..., V be hidden Markov models for words in the vocabulary. Let us assume for notational convenience that the states in all the models have unique numbers i, i = 1,2, * *, S. Typically, the word models are constructed by concatenating models for the phones (basic units of speech) as shown in Fig. 3 and Fig. 4. In the simplest case the output on the arcs of these models are acoustic labels. The acoustic signal is sampled, digitized, and feature vectors are extracted from it. These are vector quantized to produce a sequence of labels y'= (yl;.., yn). A hidden Markov model for a word2 Wk is a parametric probability function giving the likelihood of a sequence of labels $k = ( y,,.., Y,+~) on the hypothesis that wk was uttered. The parameters 0 of these models wk are the probabilities a ={aij}, where aij is the 2For convenience, we refer to both the word and the associated HMM as W,.

6 I12 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 37, NO. I, JANUARY 1991 tion that was being maximized was Fig. 3. Hidden Markov model of phone. I where h is a parameter of the algorithm. In each iteration of the maximization algorithm we used a transformation based 9n (18). In each iteration the updated parameters 2,, and brlk were obtained using the following: Fig. 4. Hidden Markov model for word Model probability of taking the transition to state j from state i in some model (the so called transition probabilities) and the probabilities b = {b,,k} where brlk is the probability of generating the output label k while taking the transition from state i to state j in some model (the so-called output probabilities). The typical training problem is to estimate the transition and output probabilities for the models given a sample of acoustics y= yl, y2,..., y,,,arising from the utterance of a sequence of words Y= W,,W,,. * ' W,,,,, t, E {l;..) V}. Maximum likelihood estimation of 0 consists of choosing the parameters so as to maximize the sample likelihood m Po( YlY) = n Po( $IT(). (30) r=l This can be achieved via the E-M algorithm based on the Baum-Eagon inequality. However, when the goal of parameter estimation is the design of a good classifier such as in a speech recognition system, it is sometimes helpful to have the estimation procedure maximize the value of some rational function of the sample likelihood P,(flY) (see [21, 171). For instance we might desire to maximize the conditional likelihood P,(YIY) which, for isolated word speech can be written as =n v = Po ( y', I Wf, P ( 4, ) c P,( $lw,) P( Wk) k=i. (31) The likelihood functions Po( $ 1 W,,) are polynomials in the parameters 0 =(a,,, b,,k) and hence the conditional likelihood is a rational function of the parameters. Other estimation criteria that give rise to rational objective functions are the maximum mutual information criterion [21 and its generalization, the H-criterion [71. (For a discussion of the motivation for using such a criterion see 121 or [7].) The algorithm described in this paper provides a fast method for estimating the parameters of the models using such criteria. A. Implementation We implemented this algorithm for estimating the parameters ai+ bijk using the H-criterion [7]. The objective func- Here x is the point in the parameter domain defined by the values a,, and brjk in the current iteration. The partial derivative of log H, appearing in the above formulas can be easily seen to be correct. The update formulas in (18) require the partial derivatives a/aa,,(n, - kd,) where No denotes the numerator of H,, D, its denominator, and k is the current value of H,. Clearly, =-[-.,---Do i a 1. (35) No aajj D, dajj In the update formulas 1/N, cancels out in the numerator and denominator, changing only the value of the constant C(x), which does not affect the correctness of the formulas. Similarly for the update formulas for brjk. In a practical implementation, the use of logs makes it easier to keep the values within the dynamic range of the computer. Since faster convergence requires a small constant c in (18) and determination of such a constant is rather involved, we used an approximate version of this transformation. The values of C(x) and D(x) in each iteration were chosen to make all the derivatives positive, C(x)=max ( max i,j ( -~ (x,);d} + (36) where E is a small positive constant. We used a forward-backward algorithm (see [l]) to compute the derivatives that go into this transformation. Notice that for sufficiently large values of C(x) and D(x) one can prove that the transformation given by (33) and (34) is a growth transformation for the H-criterion. However, for the choice of values in (36) and (37) we cannot prove this. In other words, the algorithm is not guaranteed to increase the value of the objective function in each iteration. However,

7 ~ GOPALAKRISHNAN et al.: INEQUALITY FOR RATIONAL FUNCTIONS WITH APPLICATIONS I13 TABLE I PERCENTAGE IMPROVEMENT IN OBJECTIVE FUNCTION* h Gradient Hill Climbing Our Algorithm % 21.7% % 22.4% % 16.5% *Obtained by six iterations of a gradient hill climbing algorithm and the algorithm described in Section IV. the values in (36) and (37) are easy to compute and experiments show that the objective function increases very rapidly from one iteration to the next, illustrating the usefulness of this modified version of our algorithm in practical problems. This algorithm was used to estimate the parameters of hidden Markov models for the words in the twenty thousand word speech recognition system at IBM Research. A comparison of the percentage increase in the value of the objective function using this algorithm and a gradient hill climbing algorithm for different values of the parameter h in (32) are shown in Table I. These results were obtained after going through six iterations of each algorithm. As is evident from this table, our algorithm succeeds in increasing the objective function an order of magnitude or more above what is achieved by the gradient hill climbing algorithm for the same number of iterations. In fact, the gradient algorithm requires fifty or more iterations to obtain a comparable improvement in the objective function. Notice that the amount of computation required in each iteration of our algorithm is the same as that required by the gradient hill climbing algorithm that we had implemented. In practice too much computation power is required to run about fifty iterations of such an algorithm on practical data. VI. CONCLUSION We presented an algorithm for the maximization of certain rational functions defined over domains of probability values. This algorithm is an extension of a method presented first by Baum and Eagon [3] and by others [4], Our algorithm finds application in several areas including problems arising in statistics. The motivation for this work arose from studying the estimation of parameters of a speech recognition system. The discussion presented in Section V shows that our algorithm is very useful in this practical situation, being effective even in an approximate form. It can be expected that the algorithm will prove effective in many other practical applications. ACKNOWLEDGMENT The authors thank the referees for several suggestions that improved the quality of the paper. REFERENCES [l] L. R. Bahl, F. Jelinek, and R. L. Mercer, A maximum likelihood approach to continuous speech recognition, IEEE Trans. Pattern Anal. Machine Intell., vol. PAMI-5, pp , Mar [2] L. R. Bahl, P. F. Brown, P. V. de Souza, and R. L. Mercer, Maximum mutual information estimation of hidden Markov model parameters for speech recognition, in Proc. IEEE Int. Conf. Acoust. Speech Signal Processing, 1986, pp [3] L. E. Baum, and J. A. Eagon, An inequality with applications to statistical prediction for functions of Markov processes and to a model of ecology, Bull. Amer. Math. Soc., vol. 73, pp , [4] L. E. Baum and G. Sell, Growth transformations for functions on manifolds, Pacific J. Math., vol. 27, no. 2, pp , [5] L. E. Baum, T. Petrie, G. Soules, and N. Weiss, A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains, Ann. Math. Statk., vol. 41, no. 1, pp , [6] A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. Roy. Soc. Ser. B., vol. 39, pp. 1-38, [7] P. S. Gopalakrishnan, D. Kanevsky, A. Nadas, D. Nahamoo, and M. A. Picheny, Decoder selection based on cross-entropies, in Proc. In?. Conf. Acoust. Speech Signal Processing, 1988, pp

Outline of Today s Lecture

Outline of Today s Lecture University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005 Jeff A. Bilmes Lecture 12 Slides Feb 23 rd, 2005 Outline of Today s

More information

The knee-jerk mapping

The knee-jerk mapping arxiv:math/0606068v [math.pr] 2 Jun 2006 The knee-jerk mapping Peter G. Doyle Jim Reeds Version dated 5 October 998 GNU FDL Abstract We claim to give the definitive theory of what we call the kneejerk

More information

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models. Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.com This Lecture Expectation-Maximization (EM)

More information

order is number of previous outputs

order is number of previous outputs Markov Models Lecture : Markov and Hidden Markov Models PSfrag Use past replacements as state. Next output depends on previous output(s): y t = f[y t, y t,...] order is number of previous outputs y t y

More information

Training Hidden Markov Models with Multiple Observations A Combinatorial Method

Training Hidden Markov Models with Multiple Observations A Combinatorial Method Li & al., IEEE Transactions on PAMI, vol. PAMI-22, no. 4, pp 371-377, April 2000. 1 Training Hidden Markov Models with Multiple Observations A Combinatorial Method iaolin Li, Member, IEEE Computer Society

More information

Statistical NLP: Hidden Markov Models. Updated 12/15

Statistical NLP: Hidden Markov Models. Updated 12/15 Statistical NLP: Hidden Markov Models Updated 12/15 Markov Models Markov models are statistical tools that are useful for NLP because they can be used for part-of-speech-tagging applications Their first

More information

Graphical models for part of speech tagging

Graphical models for part of speech tagging Indian Institute of Technology, Bombay and Research Division, India Research Lab Graphical models for part of speech tagging Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional

More information

Weighted Finite-State Transducers in Computational Biology

Weighted Finite-State Transducers in Computational Biology Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1 This Tutorial

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

arxiv: v1 [math.oc] 12 Oct 2018

arxiv: v1 [math.oc] 12 Oct 2018 MULTIPLICATIVE WEIGHTS UPDATES AS A DISTRIBUTED CONSTRAINED OPTIMIZATION ALGORITHM: CONVERGENCE TO SECOND-ORDER STATIONARY POINTS ALMOST ALWAYS IOANNIS PANAGEAS, GEORGIOS PILIOURAS, AND XIAO WANG SINGAPORE

More information

A Modified Baum Welch Algorithm for Hidden Markov Models with Multiple Observation Spaces

A Modified Baum Welch Algorithm for Hidden Markov Models with Multiple Observation Spaces IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 4, MAY 2001 411 A Modified Baum Welch Algorithm for Hidden Markov Models with Multiple Observation Spaces Paul M. Baggenstoss, Member, IEEE

More information

Sequential Supervised Learning

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given

More information

REVIEW OF DIFFERENTIAL CALCULUS

REVIEW OF DIFFERENTIAL CALCULUS REVIEW OF DIFFERENTIAL CALCULUS DONU ARAPURA 1. Limits and continuity To simplify the statements, we will often stick to two variables, but everything holds with any number of variables. Let f(x, y) be

More information

Multiscale Systems Engineering Research Group

Multiscale Systems Engineering Research Group Hidden Markov Model Prof. Yan Wang Woodruff School of Mechanical Engineering Georgia Institute of echnology Atlanta, GA 30332, U.S.A. yan.wang@me.gatech.edu Learning Objectives o familiarize the hidden

More information

p(d θ ) l(θ ) 1.2 x x x

p(d θ ) l(θ ) 1.2 x x x p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to

More information

A brief introduction to Conditional Random Fields

A brief introduction to Conditional Random Fields A brief introduction to Conditional Random Fields Mark Johnson Macquarie University April, 2005, updated October 2010 1 Talk outline Graphical models Maximum likelihood and maximum conditional likelihood

More information

Conditional Random Fields: An Introduction

Conditional Random Fields: An Introduction University of Pennsylvania ScholarlyCommons Technical Reports (CIS) Department of Computer & Information Science 2-24-2004 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania

More information

LOCAL AND GLOBAL EXTREMA FOR FUNCTIONS OF SEVERAL VARIABLES

LOCAL AND GLOBAL EXTREMA FOR FUNCTIONS OF SEVERAL VARIABLES /. Austral. Math. Soc. {Series A) 29 (1980) 362-368 LOCAL AND GLOBAL EXTREMA FOR FUNCTIONS OF SEVERAL VARIABLES BRUCE CALVERT and M. K. VAMANAMURTHY (Received 18 April; revised 18 September 1979) Communicated

More information

17.2 Nonhomogeneous Linear Equations. 27 September 2007

17.2 Nonhomogeneous Linear Equations. 27 September 2007 17.2 Nonhomogeneous Linear Equations 27 September 2007 Nonhomogeneous Linear Equations The differential equation to be studied is of the form ay (x) + by (x) + cy(x) = G(x) (1) where a 0, b, c are given

More information

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010 Hidden Lecture 4: Hidden : An Introduction to Dynamic Decision Making November 11, 2010 Special Meeting 1/26 Markov Model Hidden When a dynamical system is probabilistic it may be determined by the transition

More information

A gentle introduction to Hidden Markov Models

A gentle introduction to Hidden Markov Models A gentle introduction to Hidden Markov Models Mark Johnson Brown University November 2009 1 / 27 Outline What is sequence labeling? Markov models Hidden Markov models Finding the most likely state sequence

More information

(Kernels +) Support Vector Machines

(Kernels +) Support Vector Machines (Kernels +) Support Vector Machines Machine Learning Torsten Möller Reading Chapter 5 of Machine Learning An Algorithmic Perspective by Marsland Chapter 6+7 of Pattern Recognition and Machine Learning

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Hidden Markov Models

Hidden Markov Models CS769 Spring 2010 Advanced Natural Language Processing Hidden Markov Models Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Part-of-Speech Tagging The goal of Part-of-Speech (POS) tagging is to label each

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION HIDDEN MARKOV MODELS IN SPEECH RECOGNITION Wayne Ward Carnegie Mellon University Pittsburgh, PA 1 Acknowledgements Much of this talk is derived from the paper "An Introduction to Hidden Markov Models",

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Course 311: Michaelmas Term 2005 Part III: Topics in Commutative Algebra

Course 311: Michaelmas Term 2005 Part III: Topics in Commutative Algebra Course 311: Michaelmas Term 2005 Part III: Topics in Commutative Algebra D. R. Wilkins Contents 3 Topics in Commutative Algebra 2 3.1 Rings and Fields......................... 2 3.2 Ideals...............................

More information

Lecture 6: April 19, 2002

Lecture 6: April 19, 2002 EE596 Pat. Recog. II: Introduction to Graphical Models Spring 2002 Lecturer: Jeff Bilmes Lecture 6: April 19, 2002 University of Washington Dept. of Electrical Engineering Scribe: Huaning Niu,Özgür Çetin

More information

Hidden Markov models for time series of counts with excess zeros

Hidden Markov models for time series of counts with excess zeros Hidden Markov models for time series of counts with excess zeros Madalina Olteanu and James Ridgway University Paris 1 Pantheon-Sorbonne - SAMM, EA4543 90 Rue de Tolbiac, 75013 Paris - France Abstract.

More information

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I University of Cambridge MPhil in Computer Speech Text & Internet Technology Module: Speech Processing II Lecture 2: Hidden Markov Models I o o o o o 1 2 3 4 T 1 b 2 () a 12 2 a 3 a 4 5 34 a 23 b () b ()

More information

IBM Research Report. A Convex-Hull Approach to Sparse Representations for Exemplar-Based Speech Recognition

IBM Research Report. A Convex-Hull Approach to Sparse Representations for Exemplar-Based Speech Recognition RC25152 (W1104-113) April 25, 2011 Computer Science IBM Research Report A Convex-Hull Approach to Sparse Representations for Exemplar-Based Speech Recognition Tara N Sainath, David Nahamoo, Dimitri Kanevsky,

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabás Póczos & Aarti Singh Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed

More information

1 What is a hidden Markov model?

1 What is a hidden Markov model? 1 What is a hidden Markov model? Consider a Markov chain {X k }, where k is a non-negative integer. Suppose {X k } embedded in signals corrupted by some noise. Indeed, {X k } is hidden due to noise and

More information

10. Noether Normalization and Hilbert s Nullstellensatz

10. Noether Normalization and Hilbert s Nullstellensatz 10. Noether Normalization and Hilbert s Nullstellensatz 91 10. Noether Normalization and Hilbert s Nullstellensatz In the last chapter we have gained much understanding for integral and finite ring extensions.

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: ony Jebara Kalman Filtering Linear Dynamical Systems and Kalman Filtering Structure from Motion Linear Dynamical Systems Audio: x=pitch y=acoustic waveform Vision: x=object

More information

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing Hidden Markov Models By Parisa Abedi Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed data Sequential (non i.i.d.) data Time-series data E.g. Speech

More information

Section III.6. Factorization in Polynomial Rings

Section III.6. Factorization in Polynomial Rings III.6. Factorization in Polynomial Rings 1 Section III.6. Factorization in Polynomial Rings Note. We push several of the results in Section III.3 (such as divisibility, irreducibility, and unique factorization)

More information

Linear Dynamical Systems (Kalman filter)

Linear Dynamical Systems (Kalman filter) Linear Dynamical Systems (Kalman filter) (a) Overview of HMMs (b) From HMMs to Linear Dynamical Systems (LDS) 1 Markov Chains with Discrete Random Variables x 1 x 2 x 3 x T Let s assume we have discrete

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

) sm wl t. _.!!... e -pt sinh y t. Vo + mx" + cx' + kx = 0 (26) has a unique tions x(o) solution for t ;?; 0 satisfying given initial condi

) sm wl t. _.!!... e -pt sinh y t. Vo + mx + cx' + kx = 0 (26) has a unique tions x(o) solution for t ;?; 0 satisfying given initial condi 1 48 Chapter 2 Linear Equations of Higher Order 28. (Overdamped) If Xo = 0, deduce from Problem 27 that x(t) Vo = e -pt sinh y t. Y 29. (Overdamped) Prove that in this case the mass can pass through its

More information

Math 413/513 Chapter 6 (from Friedberg, Insel, & Spence)

Math 413/513 Chapter 6 (from Friedberg, Insel, & Spence) Math 413/513 Chapter 6 (from Friedberg, Insel, & Spence) David Glickenstein December 7, 2015 1 Inner product spaces In this chapter, we will only consider the elds R and C. De nition 1 Let V be a vector

More information

Conditional Random Fields for Sequential Supervised Learning

Conditional Random Fields for Sequential Supervised Learning Conditional Random Fields for Sequential Supervised Learning Thomas G. Dietterich Adam Ashenfelter Department of Computer Science Oregon State University Corvallis, Oregon 97331 http://www.eecs.oregonstate.edu/~tgd

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Core Mathematics 3 Algebra

Core Mathematics 3 Algebra http://kumarmathsweeblycom/ Core Mathematics 3 Algebra Edited by K V Kumaran Core Maths 3 Algebra Page Algebra fractions C3 The specifications suggest that you should be able to do the following: Simplify

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Math 211 Business Calculus TEST 3. Question 1. Section 2.2. Second Derivative Test.

Math 211 Business Calculus TEST 3. Question 1. Section 2.2. Second Derivative Test. Math 211 Business Calculus TEST 3 Question 1. Section 2.2. Second Derivative Test. p. 1/?? Math 211 Business Calculus TEST 3 Question 1. Section 2.2. Second Derivative Test. Question 2. Section 2.3. Graph

More information

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius Doctoral Course in Speech Recognition May 2007 Kjell Elenius CHAPTER 12 BASIC SEARCH ALGORITHMS State-based search paradigm Triplet S, O, G S, set of initial states O, set of operators applied on a state

More information

Applied Mathematics Letters

Applied Mathematics Letters Applied Mathematics Letters 25 (2012) 974 979 Contents lists available at SciVerse ScienceDirect Applied Mathematics Letters journal homepage: www.elsevier.com/locate/aml On dual vector equilibrium problems

More information

Using Expectation-Maximization for Reinforcement Learning

Using Expectation-Maximization for Reinforcement Learning NOTE Communicated by Andrew Barto and Michael Jordan Using Expectation-Maximization for Reinforcement Learning Peter Dayan Department of Brain and Cognitive Sciences, Center for Biological and Computational

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

Directed Probabilistic Graphical Models CMSC 678 UMBC

Directed Probabilistic Graphical Models CMSC 678 UMBC Directed Probabilistic Graphical Models CMSC 678 UMBC Announcement 1: Assignment 3 Due Wednesday April 11 th, 11:59 AM Any questions? Announcement 2: Progress Report on Project Due Monday April 16 th,

More information

Math123 Lecture 1. Dr. Robert C. Busby. Lecturer: Office: Korman 266 Phone :

Math123 Lecture 1. Dr. Robert C. Busby. Lecturer: Office: Korman 266 Phone : Lecturer: Math1 Lecture 1 Dr. Robert C. Busby Office: Korman 66 Phone : 15-895-1957 Email: rbusby@mcs.drexel.edu Course Web Site: http://www.mcs.drexel.edu/classes/calculus/math1_spring0/ (Links are case

More information

MATH 431 PART 2: POLYNOMIAL RINGS AND FACTORIZATION

MATH 431 PART 2: POLYNOMIAL RINGS AND FACTORIZATION MATH 431 PART 2: POLYNOMIAL RINGS AND FACTORIZATION 1. Polynomial rings (review) Definition 1. A polynomial f(x) with coefficients in a ring R is n f(x) = a i x i = a 0 + a 1 x + a 2 x 2 + + a n x n i=0

More information

Lecture 3. Gaussian Mixture Models and Introduction to HMM s. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom

Lecture 3. Gaussian Mixture Models and Introduction to HMM s. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Lecture 3 Gaussian Mixture Models and Introduction to HMM s Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson Group IBM T.J. Watson Research Center Yorktown Heights, New

More information

Handout #6 INTRODUCTION TO ALGEBRAIC STRUCTURES: Prof. Moseley AN ALGEBRAIC FIELD

Handout #6 INTRODUCTION TO ALGEBRAIC STRUCTURES: Prof. Moseley AN ALGEBRAIC FIELD Handout #6 INTRODUCTION TO ALGEBRAIC STRUCTURES: Prof. Moseley Chap. 2 AN ALGEBRAIC FIELD To introduce the notion of an abstract algebraic structure we consider (algebraic) fields. (These should not to

More information

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010 Hidden Markov Models Aarti Singh Slides courtesy: Eric Xing Machine Learning 10-701/15-781 Nov 8, 2010 i.i.d to sequential data So far we assumed independent, identically distributed data Sequential data

More information

1. Markov models. 1.1 Markov-chain

1. Markov models. 1.1 Markov-chain 1. Markov models 1.1 Markov-chain Let X be a random variable X = (X 1,..., X t ) taking values in some set S = {s 1,..., s N }. The sequence is Markov chain if it has the following properties: 1. Limited

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data Statistical Machine Learning from Data Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne (EPFL),

More information

Polynomial Rings. i=0. i=0. n+m. i=0. k=0

Polynomial Rings. i=0. i=0. n+m. i=0. k=0 Polynomial Rings 1. Definitions and Basic Properties For convenience, the ring will always be a commutative ring with identity. Basic Properties The polynomial ring R[x] in the indeterminate x with coefficients

More information

Ordinary Differential Equations (ODEs)

Ordinary Differential Equations (ODEs) Chapter 13 Ordinary Differential Equations (ODEs) We briefly review how to solve some of the most standard ODEs. 13.1 First Order Equations 13.1.1 Separable Equations A first-order ordinary differential

More information

Log-Linear Models, MEMMs, and CRFs

Log-Linear Models, MEMMs, and CRFs Log-Linear Models, MEMMs, and CRFs Michael Collins 1 Notation Throughout this note I ll use underline to denote vectors. For example, w R d will be a vector with components w 1, w 2,... w d. We use expx

More information

ON IDENTIFIABILITY AND INFORMATION-REGULARITY IN PARAMETRIZED NORMAL DISTRIBUTIONS*

ON IDENTIFIABILITY AND INFORMATION-REGULARITY IN PARAMETRIZED NORMAL DISTRIBUTIONS* CIRCUITS SYSTEMS SIGNAL PROCESSING VOL. 16, NO. 1,1997, Pp. 8~89 ON IDENTIFIABILITY AND INFORMATION-REGULARITY IN PARAMETRIZED NORMAL DISTRIBUTIONS* Bertrand Hochwald a and Arye Nehorai 2 Abstract. We

More information

A NOTE ON TRIGONOMETRIC MATRICES

A NOTE ON TRIGONOMETRIC MATRICES A NOTE ON TRIGONOMETRIC MATRICES garret j. etgen Introduction. Let Q(x) he an nxn symmetric matrix of continuous functions on X: 0^x

More information

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm + September13, 2016 Professor Meteer CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm Thanks to Dan Jurafsky for these slides + ASR components n Feature

More information

Expectation Maximization (EM)

Expectation Maximization (EM) Expectation Maximization (EM) The Expectation Maximization (EM) algorithm is one approach to unsupervised, semi-supervised, or lightly supervised learning. In this kind of learning either no labels are

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Abstract

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Abstract INTERNATIONAL COMPUTER SCIENCE INSTITUTE 947 Center St. Suite 600 Berkeley, California 94704-98 (50) 643-953 FA (50) 643-7684I A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Figure : Learning the dynamics of juggling. Three motion classes, emerging from dynamical learning, turn out to correspond accurately to ballistic mot

Figure : Learning the dynamics of juggling. Three motion classes, emerging from dynamical learning, turn out to correspond accurately to ballistic mot Learning multi-class dynamics A. Blake, B. North and M. Isard Department of Engineering Science, University of Oxford, Oxford OX 3PJ, UK. Web: http://www.robots.ox.ac.uk/vdg/ Abstract Standard techniques

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Lines, parabolas, distances and inequalities an enrichment class

Lines, parabolas, distances and inequalities an enrichment class Lines, parabolas, distances and inequalities an enrichment class Finbarr Holland 1. Lines in the plane A line is a particular kind of subset of the plane R 2 = R R, and can be described as the set of ordered

More information

EECS E6870: Lecture 4: Hidden Markov Models

EECS E6870: Lecture 4: Hidden Markov Models EECS E6870: Lecture 4: Hidden Markov Models Stanley F. Chen, Michael A. Picheny and Bhuvana Ramabhadran IBM T. J. Watson Research Center Yorktown Heights, NY 10549 stanchen@us.ibm.com, picheny@us.ibm.com,

More information

4 Differential Equations

4 Differential Equations Advanced Calculus Chapter 4 Differential Equations 65 4 Differential Equations 4.1 Terminology Let U R n, and let y : U R. A differential equation in y is an equation involving y and its (partial) derivatives.

More information

SOME NEW EXAMPLES OF INFINITE IMAGE PARTITION REGULAR MATRICES

SOME NEW EXAMPLES OF INFINITE IMAGE PARTITION REGULAR MATRICES #A5 INTEGERS 9 (29) SOME NEW EXAMPLES OF INFINITE IMAGE PARTITION REGULAR MATRIES Neil Hindman Department of Mathematics, Howard University, Washington, D nhindman@aolcom Dona Strauss Department of Pure

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

Lecture 14 February 28

Lecture 14 February 28 EE/Stats 376A: Information Theory Winter 07 Lecture 4 February 8 Lecturer: David Tse Scribe: Sagnik M, Vivek B 4 Outline Gaussian channel and capacity Information measures for continuous random variables

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing Hidden Markov Models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 33 Introduction So far, we have classified texts/observations

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

Discriminative Learning in Speech Recognition

Discriminative Learning in Speech Recognition Discriminative Learning in Speech Recognition Yueng-Tien, Lo g96470198@csie.ntnu.edu.tw Speech Lab, CSIE Reference Xiaodong He and Li Deng. "Discriminative Learning in Speech Recognition, Technical Report

More information

Solutions to Exercises, Section 2.5

Solutions to Exercises, Section 2.5 Instructor s Solutions Manual, Section 2.5 Exercise 1 Solutions to Exercises, Section 2.5 For Exercises 1 4, write the domain of the given function r as a union of intervals. 1. r(x) 5x3 12x 2 + 13 x 2

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

PUTNAM TRAINING POLYNOMIALS. Exercises 1. Find a polynomial with integral coefficients whose zeros include

PUTNAM TRAINING POLYNOMIALS. Exercises 1. Find a polynomial with integral coefficients whose zeros include PUTNAM TRAINING POLYNOMIALS (Last updated: December 11, 2017) Remark. This is a list of exercises on polynomials. Miguel A. Lerma Exercises 1. Find a polynomial with integral coefficients whose zeros include

More information

Algebraic Varieties. Notes by Mateusz Micha lek for the lecture on April 17, 2018, in the IMPRS Ringvorlesung Introduction to Nonlinear Algebra

Algebraic Varieties. Notes by Mateusz Micha lek for the lecture on April 17, 2018, in the IMPRS Ringvorlesung Introduction to Nonlinear Algebra Algebraic Varieties Notes by Mateusz Micha lek for the lecture on April 17, 2018, in the IMPRS Ringvorlesung Introduction to Nonlinear Algebra Algebraic varieties represent solutions of a system of polynomial

More information

EM for Spherical Gaussians

EM for Spherical Gaussians EM for Spherical Gaussians Karthekeyan Chandrasekaran Hassan Kingravi December 4, 2007 1 Introduction In this project, we examine two aspects of the behavior of the EM algorithm for mixtures of spherical

More information

11. Learning graphical models

11. Learning graphical models Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical

More information

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms Recognition of Visual Speech Elements Using Adaptively Boosted Hidden Markov Models. Say Wei Foo, Yong Lian, Liang Dong. IEEE Transactions on Circuits and Systems for Video Technology, May 2004. Shankar

More information

1) The line has a slope of ) The line passes through (2, 11) and. 6) r(x) = x + 4. From memory match each equation with its graph.

1) The line has a slope of ) The line passes through (2, 11) and. 6) r(x) = x + 4. From memory match each equation with its graph. Review Test 2 Math 1314 Name Write an equation of the line satisfying the given conditions. Write the answer in standard form. 1) The line has a slope of - 2 7 and contains the point (3, 1). Use the point-slope

More information

Hidden Markov Modelling

Hidden Markov Modelling Hidden Markov Modelling Introduction Problem formulation Forward-Backward algorithm Viterbi search Baum-Welch parameter estimation Other considerations Multiple observation sequences Phone-based models

More information

A minimalist s exposition of EM

A minimalist s exposition of EM A minimalist s exposition of EM Karl Stratos 1 What EM optimizes Let O, H be a random variables representing the space of samples. Let be the parameter of a generative model with an associated probability

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

CS838-1 Advanced NLP: Hidden Markov Models

CS838-1 Advanced NLP: Hidden Markov Models CS838-1 Advanced NLP: Hidden Markov Models Xiaojin Zhu 2007 Send comments to jerryzhu@cs.wisc.edu 1 Part of Speech Tagging Tag each word in a sentence with its part-of-speech, e.g., The/AT representative/nn

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information