FINITE STATE LANGUAGE MODELS SMOOTHED USING n-grams

Size: px
Start display at page:

Download "FINITE STATE LANGUAGE MODELS SMOOTHED USING n-grams"

Transcription

1 International Journal of Pattern Recognition and Artificial Intelligence Vol. 16, No. 3 (2002) c World Scientific Publishing Company FINITE STATE LANGUAGE MODELS SMOOTHED USING n-grams DAVID LLORENS and JUAN MIGUEL VILAR Departament de Llenguatges i Sistemes Informàtics, Universitat Jaume I de Castelló, Spain dllorens@lsi.uji.es jvilar@lsi.uji.es FRANCISCO CASACUBERTA Dpt. de Sistemes Informàtics i Computació, Institut Tecnològic d Informàtica, Universitat Politècnica de València, Spain fcn@iti.upv.es We address the problem of smoothing the probability distribution defined by a finite state automaton. Our approach extends the ideas employed for smoothing n-gram models. This extension is obtained by interpreting n-gram models as finite state models. The experiments show that our smoothing improves perplexity over smoothed n-grams and Error Correcting Parsing techniques. Keywords: Language modeling; smoothing; stochastic finite state automata. 1. Introduction In different tasks like speech recognition, OCR, or translation, it is necessary to estimate the probability of a given sentence. For example, in speech recognition, we are given an utterance u (a sequence of vectors representing the sounds emitted by the speaker) and we need to compute the most probable sentence ŵ given u. That is ŵ = arg max P (w u). w Using Bayes, and since u is given, we can transform this into ŵ = arg max P (w)p (u w). w The term P (w) isknownaslanguage model and intuitively represents the a priori probability of utterance of w. Ideally, these models should represent the language Work partially funded by Fundación Bancaja s Neurotrad project (P1A99-10). 275

2 276 D. Llorens, J. M. Vilar & F. Casacuberta under consideration and they should not allow ungrammatical sentences. In practice, it is adequate that they define a probability distribution over the free monoid of all possible sentences. This distribution is expected to be non-null at each point. The process of modifying a given distribution to be non-null is traditionally called smoothing. A widely used language model is the so-called n-gram. These models estimate the probability of a word considering only the previous n 1 words. A large range of different smoothing techniques exist for n-gram models. 2 However, when the distribution is defined by a probabilistic finite state automaton, the situation is very different: the only non-trivial smoothing in the literature is Error Correcting Parsing. 4 We present an approach to smoothing finite state models that is based on the techniques used for smoothing n-gram models. This is done by first interpreting the n-gram models as finite state automata. This interpretation is extended to the smoothing techniques. The smoothing is then modified in order to adapt it for general finite state models. The basic idea underlying the smoothing is to parse the input sentence using the original automaton as long as possible, resorting to an n-gram (possibly smoothed) when arriving to a state with no adequate arcs for the current input word. After visiting the n-gram, control returns to the automaton. This is similar to backoff smoothing, where the probability for a word is given from its n 1 predecessors when possible; if it is not possible, only n 2 predecessors are considered, and so on. An example may help in clarifying this. Suppose that the corpus has only these two sentences: I saw two new cars you drove a new car Assume also that the learning algorithm yields the model at the top in Fig. 1. The bigram model for the corpus is seen in the middle of that figure, and the unigram on the bottom (the arc labeled Σ stands for the whole vocabulary). We have to estimate now the probability for the new sentence I drove a new car. The parsing starts in the initial state of our model. We follow the arc to state B. Now, there are no output arcs with label drove, sowehavetoresorttothe bigram. We go down to the state corresponding to I (the history so far). As there is also no arc, we go down again to the unigram. Now, using the probability of the unigram, we can go up to the state in the bigram corresponding to drove. When the word a is seen, we can return to the original automaton. We do this by observing that the history so far is drove a and that state I has that history. The words new and car can be treated by the automaton.

3 Unigram Bigram Automaton Smoothing Using n-grams 277 saw two new I B C D E cars A you drove a new G H I J car F a saw two I new cars you drove a new car drove ± Fig. 1. Parsing of I drove a new car by an automaton smoothed using a bigram, which in turn is smoothed by an unigram. Dotted lines represent arcs of the models not used in the parsing. 2. Basic Concepts and Notation In this section, we introduce some basic concepts in order to fix the notation. An alphabet is a finite set of symbols (words), we represent them by calligraphic letters like X. Strings (or sentences) are concatenations of symbols and represented using a letter and a small bar, like in x. The individual words are designated by the name of the sentence and a subindex indicating the position, so x = x 1 x 2...x n. The length of a sentence is indicated by x. Segments of a sentence are denoted by x j i = x i...x j. For substrings of the form x x i we use the notation x i.thesetofall strings over X is represented by X. The empty string is represented by λ. A probabilistic finite state automaton η is a sixtuple (X,Q,q 0,δ,π,ψ)where: Xis an alphabet. Q is a finite set of states. q 0 is the initial state (it belongs to Q).

4 278 D. Llorens, J. M. Vilar & F. Casacuberta δ is the set of transitions. π is the function that assigns probabilities to transitions. ψ is the function that assigns to the states the probability of being final. The set δ is a subset of Q (X {λ}) Q. This represents the possible movements for the automaton. The elements of δ are assigned probabilities by π, a complete function from δ into the real numbers (usually, it will be between 0 and 1, but we will relax that somehow). Abusing notation, we will write δ(q, x) fortheset {q Q (q, x, q ) δ}. If δ(q, x) = 1, we will write π(q, x) forπ(q, x, δ(q, x)). An automaton can be used for defining a probability distribution on X. First define a path from q to q with input x to be a sequence (q 1,x 1,q 2 ) (q 2,x 2,q 3 )...(q n,x n,q n+1 ) such that q 1 = q, q n+1 = q and x = x 1...x n (note that n may be larger that x sincesomeofthex i may be λ). Define the probability of the path to be n i=1 π(q i,x i,q i+1 ). Now, for a string x, define its probability P η ( x) to be the sum of the products of the probabilities of all the paths from q 0 to some q with input x times the final probability of q. Symbolically: P η ( x) = c π(q i,x i,q i+1 ) ψ(q c +1 ), c C(q 0, x) i=1 where C(q 0, x) is the set of paths starting in q 0 and having x as input. The wellknown forward algorithm can be used for computing this quantity. If there are no λ-arcs in the automaton, we can use the version of Fig. 2(c). The absence of λ-arcs is not an important restriction since in this paper we will smooth automata without such arcs. Furthermore, an equivalent automaton without λ-arcs can always be obtained. During the rest of the paper, we will assume that we are working with a training corpus for building the models. We use the function N to represent the number of times that certain string appears in that corpus. 3. The n-gram Model as a Finite State Model A language model computes the probability of a sentence x. A usual decomposition of this probability is P ( x) =P (x 1 )P (x 2 x 1 ) P(x n x n 1 1 ). We can say then that the language model computes the probability that a word x follows a string of words q, which is the history of that word. Since it is impossible to accurately estimate P (x q) for all possible q, some restriction has to be placed. An n-gram model considers only the last n 1 words of q. But, since even that may be too long, in case it is not possible to estimate P (x q q n+1) directly from the training corpus, the backoff-smoothed (BS) n-gram model uses the last n 2 words and so on until arriving to an unigram model. 5 More formally: { P (x q) N( qx) > 0, P BO (x q) = C q P BO (x q 2) N( qx) =0,

5 Smoothing Using n-grams 279 where P is the discounted probability of P obtained with the chosen discount strategy and C q is a normalization constant chosen so that P BO (x q) =1. The CMU-Cambridge Toolkit, 3 is a well-known tool for building BS n-gram models The n-gram model as a deterministic automaton We can represent a backoff model by a deterministic automaton η = (X,Q η, q 0,δ η,π η,ψ η ), as follows. The set of states will be Q η = { q q (X {#}) n 1 } {λ}. That is, we have a state for each possible n-gram plus an initial state (# n ). The special symbol # is not part of the vocabulary and is used to simplify the notation. The states represent the history used so far. A string w 1 w n 1 represents the fact that the last n 1 words were w 1,...,w n 1.On the other hand, a string # #w represents the moment in which the first word has been read and the next has to be predicted. Finally, the end of the sentence corresponds to the automata predicting # for a given history. Note that in this manner, all the histories have the same length. The transitions will be δ η ( q, x) =( qx) 2. The function π will reflect the probabilities of the backoff model, π η ( q, x) =P BO (x q). Finally, ψ η ( q) =P BO (# q), assigns the discounted final state probability. It is trivial to see that this deterministic automaton is proper, so it is consistent using traditional deterministic parsing [Fig. 2(a)]. Moreover, the probability distribution induced is the same as for the original model. x X 3.2. The n-gram model as a nondeterministic automaton The representation of the previous section can be too large. An equivalent, but much smaller, model can be obtained by using a nondeterministic automaton. For this, we define the automaton η =(X,Q η, q 0,δ η,π η,ψ η ), as follows. The set of states will be Q η = { q q (X {#}) <n,n( q) > 0} {λ}. Sowehaveastatefor each k-gram seen in the training (1 <= k<n) plus a state for the unigram. These states represent the history used so far. Now, the transitions will be: ( qx) 2 if q = n 1 x X, δ η ( q,x) = qx if q <n 1 x X, q 2 if q λ x = λ. The nonempty transitions represent the way the history is updated with the new words, the empty transitions from the k-grams to the corresponding (k 1)-grams represent the backing-off. The function π will reflect the probabilities of the backoff model: { P (x q) if x X, π η ( q, x) = (1) C q if x = λ. Finally, ψ η ( q) =P (# q), assigns the discounted final state probability.

6 280 D. Llorens, J. M. Vilar & F. Casacuberta As this automaton is not proper, the language it generates is not consistent using standard parsing. Instead, we should use what we call deterministic-backoff parsing. The idea is that given a state q and a word x, ifδ( q, x) exists, it is used, if not, the λ-arc is used and the process is repeated in q 2. This algorithm can be seen in Fig. 2(b). It can be easily proven that this automaton using deterministic-backoff parsing is equivalent to the traditional backoff n-gram model The GIS n-gram model If we do not insist in having exactly the same probability as the backoff model, a new smoothing method can be derived from (1). This is done by choosing C q such that the model is consistent; the appropriate value is simply C q =1 x X π η( q, x). This ensures that the model is consistent and no special parsing is needed (i.e. forward parsing works correctly). We call this new model GIS n-gram (General Interpolation Smoothing). Note that, for n = 2, this model is the same as the model obtained by interpolating the bigram with the unigram (the traditional interpolated model 2 ), however, for n>2 both models are different. 4. Automata Smoothing Using n-grams (SUN) Suppose that you have both an stochastic (possibly nondeterministic) automaton τ and a smoothed n-gram model η. Itistemptingtouseη for smoothing τ. We present here a way to do this. The idea is to create arcs between τ and η. These new arcs come in two groups: the down arcs and the up arcs. The down arcs are λ-arcs from τ into η, they are used when the current word has no arc from the current state. The probability of these arcs is discounted from the original arcs of τ. The up arcs, are used to return to τ and they will distribute the original probabilities of η. In order to present the construction, we need the concept of set of histories of length k for a state q. Thisis H τ,k (q) ={# l h # l X k l q 0 h q} { h X k p Q τ : p h q}. Intuitively, this represents those strings of length k that are suffixes of a path leading to q. In case there is a path from the initial state shorter than k, the corresponding string is padded at the beginning with symbols #. It is also useful to define the pseudo-inverse function H 1 τ,k ( v) which is the set of states in Q τ that have the k length suffix of string v in their histories. These functions are easily extended to sets: H τ,k (Q) = H τ,k (q), Q P(Q), q Q H 1 τ,k (H) = v H H 1 τ,k ( v), H P(X k ).

7 Smoothing Using n-grams 281 Algorithm Deterministic parsing Input string μx, automaton Output probability ofμx q := q 0 ; p := 1; for i := 1 to jμxj do p := p ß(q; x i ); q := ffi(q; x i ); end for p := p ψ(q); return p; End (a) Deterministic parsing Algorithm Determ.-backoff parsing Input string μx, automaton Output probability ofμx q := q 0 ; p := 1; for i := 1 to jμxj do while ffi(q; x i )=; do q := ffi(q; ) p := p ß(q; ); end while p := p ß(q; x i ); q := ffi(q; x i ); end for p := p ψ(q); return p; End (b) Deterministic-backoff parsing Algorithm Forward parsing Input string μx, automaton Output probability ofμx P 1 [q 0 ]:=1;V 1 = fq 0 g; for i := 1 to jμxj do V 2 = ;; P 2 := [0]; for q 2 V 1 do p := P 1 [q]; for q 0 2 ffi(q; x i ) do V 2 := V 2 [fq 0 g; P 2 [q 0 ]:=P 2 [q 0 ]+p ß(q; x i ;q 0 ); end for end for V 1 := V 2 ; P 1 := P 2 ; end for return P q2v2 P 2[q] ψ(q); End (c) Forward parsing Algorithm Forward-backoff parsing Input string μx, automaton Output probability ofμx P 1 [q 0 ]:=1;V 1 = fq 0 g; for i := 1 to jμxj do V 2 = ;; P 2 := [0]; for q 2 V 1 do p := P 1 [q]; while ffi(q; x i )=; do q := ffi(q; ) p := p ß(q; ); end while for q 0 2 ffi(q; x i ) do V 2 := V 2 [fq 0 g; P 2 [q 0 ]:=P 2 [q 0 ]+p ß(q; x i ;q 0 ); end for end for V 1 := V 2 ; P 1 := P 2 ; end for return P q2v2 P 2[q] ψ(q); End (d) Forward-backoff parsing Fig. 2. The parsing algorithms. The areas marked correspond to the (small) differences between the conventional and the backoff versions. The rest of the algorithms is unchanged. The rest of this section is divided into three parts: first, we present the formal models; after that, a particular set of probabilities for doing the smoothing is explained; finally, we introduce the construction of automata with an interesting property: the set of histories of each state is a singleton. This makes them somehow analogous to the n-grams and facilitates smoothing.

8 282 D. Llorens, J. M. Vilar & F. Casacuberta 4.1. Definition of the models The set of states of the smoothed model is simply Q τ Q η, the union of the states of both models (without loss of generality, we assume that they are disjoint). There are four types of arcs: Down arcs (δ d ). Up arcs (δ u ). Stay arcs (δ η). The original arcs of τ (δ τ ). As commented above, down arcs go from τ into η. Suppose that the analysis is in state q and the following word has no arc departing from q. Inthiscase,it is sensible to resort to η. We can only tell that the path leading to q has followed one of the histories of length n 1. So we will have a λ-arc corresponding to each of these histories. This arc will go to the only state in η having that history (since η is an n-gram there can be at most one state for each history of length n 1 and, assuming the same training data for τ and η, is safe to expect that such state actually exists). Formally: δ d = {(q, λ, h) q Q τ h H τ,n 1 (q)}. Notice that we are identifying the states in Q η (the states from the smoothed n-gram model η) with the strings leading to them. This can be done as in the construction presented in Sec Once the analysis has visited η, it should go back to τ. Thisisdonebymeans of the up arcs. The idea is better explained from a particular state in Q η and a symbol x. Sinceη is an n-gram there is only a history h arriving at it. When we join it with the symbol under consideration, we get a longer history. The up arc is drawn from q into those states having hx in their set of histories. This seeks to ensure that the analysis returns to a sensible point in τ. In a more formal way: δ u = {( h, x, q) Q η X Q τ q Hτ,n( hx)} 1. Note that there may be symbols for which the extended history does not belong to any state in τ. This gives rise to stay arcs, which are the arcs in η that do not generate up arcs: δ η = {( h, x, h ) δ η δ u ( h, x) = }. Finally, all the original arcs of τ are also part of the model (δ τ ). In parallel with the definition of the arcs, we find the definition of four sets of probabilities. The down arcs get their probabilities after discounting some probability mass from the arcs of τ. We use certain function d : δ τ [0, 1] for this discounting. The remaining probability of each state is distributed among the corresponding down arcs. This is done by function b : δ d [0, 1], which must fulfill: b(q, λ, h) =1. (2) h δ d (q,λ)

9 Smoothing Using n-grams 283 With these two functions, we can define π τ, the probability of the arcs of τ in the new model as π τ (q, x, q )=d(q, x, q )π(q, x, q ), for (q, x, q ) δ τ. And the definition of π d, the probability of the down arcs is π d (q, λ, h) =C q b(q, λ, h), for (q, λ, h) δ d. The normalization constant C q is computed in a manner analogous to the n-gram backoff smoothing. The up arcs will get their probabilities by distributing the probability of the arc that originated them. For this, we use function s : δ u [0, 1], that must fulfill the following s( h, x, q) =π η ( h, x). (3) q δ u( h,x) With this function, the probabilities of the up arcs are trivial, simply define π u ( h, x, q) =s( h, x, q), for ( h, x, q) δ u. Finally, the stay arcs keep their original probabilities, so π η ( h, x, h )=π η ( h, x, h ), for ( h, x, h ) δ η. We define SUN (τ,η,d,b,s) as the automaton (X,Q, q 0,δ,π,ψ)where: (a) The states, Q, are the union of the states from τ and η: Q = Q τ Q η. (b) The transitions are the union of the four sets of arcs: δ = δ τ δ d δ u δ η. (c) The probabilities of the arcs are the union of the four sets of probabilities: π = π τ π d π u π η, (d) the function ψ, which assigns the final probability to every state, is: { ψ τ (q) if q Q τ, ψ(q) = ψ η (q) if q Q η, where ψ τ (q) is the discounted final probability. This automaton, which we call SUN BS, is neither deterministic nor proper. To make the defined language consistent, a special parsing is required. We present the Forward-backoff parsing, which is a modified forward parsing that does not use λ-arcs in an active state if there exists an arc labeled with the next input symbol [Fig. 2(d)]. As in n-gram backoff smoothing, we can obtain a SUN GIS automaton by choosing C q so that the automaton is proper. The language it models is consistent using forward parsing. The methods SUN BS and SUN GIS constitute a generalization of the n-gram smothing techniques BS and GIS, respectively. This can be seen in the fact that if an n-gram is smoothed with an smoothed (n 1)-gram using SUB BS (respectively, SUN GIS), the result is the same of a BS (respectively GIS) model.

10 284 D. Llorens, J. M. Vilar & F. Casacuberta 4.2. Choosing the distributions The formalization of the SUN allows different parameterizations for the functions d, b and s. Inn-gram smoothing literature there are several discounting techniques that can be easily adapted to play the role of d. We have chosen the Witten Bell discounting, 6 which is powerful and simple to implement. Given an automaton τ: π τ (q, x, q )=d(q, x, q ) π τ (q, x, q N(q) N(q, x, q ) )= = N(q, x, q ) N(q)+D(q) N(q) N(q)+D(q), where N(q, x, q )andn(q) are the number of times this arc/state was used on analyzing the training corpus, and D(q) is the number of different labeled arcs following state q (plus one if it is final). The smoothed probability of final state is ψ τ (q) = N f (q) N(q)+D(q),whereN f (q) is the number of times q was final when analyzing the training corpus. We feel that functions b and s should take into account the frequency of the different histories in each state. This information can be easily obtained using a history counting function over a training corpus. As a way of distributing the probabilities in accordance with these counts, we define b(q, λ, h) ands( h, x, q) as: b(q, λ, h) = E τ,q ( h) E τ,q ( h ), h H τ,n(q) s( h, x, q) =π η ( h, x) E τ,q ( hx) E τ,q ( hx), q H 1 τ,n( hx) where E τ,q ( h) is the number of times we arrive at state q of τ with history h. Itcan be easily proven that both b and s satisfy their respective restrictions, (2) and (3) Single history automata A problem for SUN is the existence of more than one history of length n 1in the states of the automaton. Remember that the n-gram predicts the next word by considering the previous n 1. Now, consider state q and suppose it has three histories h 1, h 2,andh 3 of length n 1. This implies the construction of three down arcs. But, when actually using the automaton, at most one of the histories is the one seen, so the other two are used for paths that lead to states of the n-gram with histories not belonging to these states. It would be desirable to have only one history per state, so that it is possible to go down to the state with the correct history. We can in fact do that for general automata. First, define τ to be an n-sh automaton if for every state q the set H τ,n (q) is a singleton. The interesting result is that every automaton can be converted to an equivalent n-sh by a simple

11 Smoothing Using n-grams 285 construction. Given τ, the corresponding n-sh automaton is τ with: Q τ = {(q, h) h H τ,n (q)}. q 0,τ =(q 0, # n ). δ τ = {((q, h),x,(q, ( hx) 2 )) (q, x, q ) δ τ h H τ,n (q)}. With respect to probabilities, this conversion can be done in two manners. The first one is to translate the probabilities so that the new automaton represents the same distribution. The other one, which we have followed, takes into account that the original distribution was inferred from a training corpus and is therefore only an approximation. We feel that it is better to reestimate the probabilities on the new automaton. As seen in the experiments below, the best approach for smoothing the automaton is to first convert it in an n-sh, then reestimate the probabilities, and after that to use the SUN. 5. Experiments To test the SUN we use ATIS-2, a subcorpus of ATIS, which is a set of sentences recorded from speakers requesting information to an air travel database. ATIS-2 contains a training set of 13,044 utterances (130,773 tokens), and two evaluation sets of 974 utterances (10,636 tokens) and 1,001 utterances (11,703 tokens), respectively. The vocabulary contains 1,294 words. We use the second evaluation set as test set. The first evaluation set was not used. However, the Error Correcting Parsing (ECP) approach we compare with uses this set to estimate the error model. Three aspects were considered for the experiments: SUN versus other automata smoothing techniques; SUN GIS versus SUN BS; and automata smoothing versus n-gram smoothing. The experiments consisted in building the smoothed unigram, bigram and trigram for the task (using both GIS and BS smoothing), plus an automaton obtained using the ALERGIA algorithm. 1 This automaton was smoothed using direct smoothing with a unigram, Error Correcting Parsing, and the SUN approach. The results of the first two methods constitute the reference and they were presented by Dupont and Amengual. 4 The SUN approach was tested with the different n-grams and using both the original automaton and the corresponding n-shs. Table 1 shows the results of the experiments. The reference results are those presented by Dupont and Amengual. 4 Their experiments were done with the same corpus and the same ALERGIA automaton, so our results are comparable with theirs. They smoothed the automaton using an unigram and different ECP models (ECP smoothing does not use n-grams). They obtained a perplexity of 71 with the unigram smoothing and of 37 with the best ECP model, nearly a 50% improvement but still far from the perplexity of bigrams or trigrams.

12 286 D. Llorens, J. M. Vilar & F. Casacuberta Table 1. Test set perplexity results. Column SUN SH (SUN using single history automata) shows two results: using all paths and only the best one. (a) GIS (b) BS (c) Reference n n-gram SUN SUN SH n n-gram SUN SUN SH Direct ECP / / / / / /15 Table 2. Number of arcs of the automaton and of the n-gram traversed during the parsing of the test using only the best path. The last column shows the number of down arcs employed. The number of symbols in the test is (a) GIS (b) BS n automaton n-gram down arcs n automaton n-gram down arcs l Our results improve theirs. The use of SUN GIS with unigrams slightly improves perplexity. When using bigrams or more, the results are far better (nearly a 30% improvement over ECP for trigrams), but still worse than n-grams. When the automaton is transformed into its equivalent n-sh, the results are even better than for bigrams. For trigrams they are equivalent. When using BS, the conclusions are very similar except for the case of the automaton not transformed in n-sh. In the last column of Table 1, we can see the effect of using only the best path instead of summing all possible paths. This effect is negligible for SUN BS but important for SUN GIS. The difference is due to the large number of paths accepting a sentence in SUN GIS. Another important question is whether the parsing uses more arcs from the automaton or from the n-gram. It would be desirable that most of the arcs correspond to the automaton. In Table 2, we can see the numbers for the best paths. Both for SUN BS and SUN GIS, most of the symbols are parsed in the automaton. The ratio decreases as the degree of the n-gram increases. This shows that the better the smoothing model, the more it is used. Using SUN BS and unigrams the percentage of symbols parsed by the automaton is 91%, for bigrams 82% and for trigrams 67%. On the other hand, the percentage of words using down arcs is 5% for unigrams, 10% for bigrams and 13% for trigrams. The behavior of SUN GIS is similar. For comparison purposes, we have repeated the tests of BS n-gram models using the CMU-Cambridge Toolkit 3 with the same discounting function. As expected, the results are the same as the second column in Table 1(b).

13 Smoothing Using n-grams Conclusions Smoothed n-grams can be formalized as finite state models. The formalization has been used in two directions: for defining a new smoothing for n-grams, GIS; and for extending backoff smoothing and GIS to general automata. The extension of the smoothing to n-grams has been made using several new concepts also introduced here. In the first place, the definition of the sets of histories of a state is used in order to define the relationship between the states of the automaton and the n-gram used for smoothing. Related to that, we present a transformation of automata (n-sh) that makes them conceptually nearer to the structure of the n-grams, therefore easing the smoothing. Finally, appropriate modifications to the parsing algorithms are presented in order to cope with the new automata. The experimental results indicate that the methods obtain a suitable smoothing. This is translated in better perplexities than the models used for smoothing them, while keeping the structure of the automaton, allowing for modeling more complex relationships than possible with n-grams. References 1. R. C. Carrasco and J. Oncina, Learning deterministic regular grammars from stochastic samples in polynomial time, Theoret. Inform. Appl. 33, 1 (1999) S. F. Chen and J. Goodman, An empirical study of smoothing techniques for language modeling, Comput. Speech Lang. 13, 4 (1999) P. Clarkson and R. Rosenfeld, Statistical language modeling using the CMU-Combridge Toolkit, EUROSPEECH, 1997, pp P. Dupont and J.-C. Amengual, Smoothing probabilistic automata: an errorcorrecting approach, Grammatical Inference: Algorithma and Applications, ed. A. de Oliveira, Lecture Notes in Artificial Intelligance, Vol. 1891, Springer-Verlag, 2000, pp S. M. Katz, Estimation of probabilities from sparse data for the language model component of a speech recognizer, IEEE Trans. ASSP 34, 3 (1987) P. Placeway, R. Schwartz, P. Fung and L. Nguyen, The estimation of powerful language models from small and large corpora, ICASSP 93, Vol. II, 1993, pp

14 288 D. Llorens, J. M. Vilar & F. Casacuberta David Llorens received the Master and Ph.D. degrees in computer science from the Polytechnic University of València, Spain, in 1995 and 2000, respectively. From 1995 to 1997 he worked with the Department of Information Systems and Computation, Polytechnic University of València, first as a predoctoral fellow and from 1996 as Assistant Professor. Since 1997, he has been with the Department of Software and Computing Systems of the University Jaume I of Castelló, Spain, first as Assistant Professor and from 2001 as Associate Professor. He has been an active member of the Pattern Recognition and Artificial Intelligence Group in the Department of Information Systems and Computation, and now he works with the Group for Computational Learning, Automatic Recognition and Translation of Speech at the Department of Software and Computing Systems of the University Jaume I of Castelló, Spain. His current research lies in the areas of finite state language modeling, language model smoothing, speech recognition, machine translation and syntactic pattern recognition. Juan Miguel Vilar received the B.Sc. in computer studies in 1990 from the Liverpool Polytechnic (now the John Moore s University) in England, and the M.Sc. and Ph.D. degrees in computer science in 1993 and 1998 from the Polytechnic University of Valencia in Spain. He is currently working as an Associate Professor with the Department of Software and Computing Systems of the University Jaume I of Castelló, Spain. He has been an active member of the Pattern Recognition and Artificial Intelligence Group in the Department of Information Systems and Computation and now works with the Group for Computational Learning, Automatic Recognition and Translation of Speech at the Department of Software and Computing Systems of the University Jaume I of Castelló, Spain. His current research is in finite state and statistical models for machine translation, computational learning, language modeling, and word clustering.

15 Smoothing Using n-grams 289 Francisco Casacuberta received the Master and Ph.D. degrees in physics from the University of València, Spain, in 1976 and 1981, respectively. From 1976 to 1979, he worked with the Department of Electricity and Electronics at the University of València as an FPI fellow. From 1980 to 1986, he was with the Computing Center of the University of València. Since 1980, he has been with the Department of Information Systems and Computation of the Polytechnic University of València first as a Associate Professor and from 1990 as a Full Professor. Since 1981, he has been an active member of a research group in the fields of automatic speech recognition and machine translation. Dr. Casacuberta is a member of the Spanish Society for Pattern Recognition and Image Analysis (AERFAI), which is an affiliate society of IAPR, the IEEE Computer Society and the Spanish Association for Artificial Intelligence (AEPIA). His current research interest lies in the areas of speech recognition, machine translation, syntactic pattern recognition, statistical pattern recognition and machine learning.

16

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3) Chapter 5 Language Modeling 5.1 Introduction A language model is simply a model of what strings (of words) are more or less likely to be generated by a speaker of English (or some other language). More

More information

Probabilistic Language Modeling

Probabilistic Language Modeling Predicting String Probabilities Probabilistic Language Modeling Which string is more likely? (Which string is more grammatical?) Grill doctoral candidates. Regina Barzilay EECS Department MIT November

More information

Language Modeling. Michael Collins, Columbia University

Language Modeling. Michael Collins, Columbia University Language Modeling Michael Collins, Columbia University Overview The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques: Linear interpolation Discounting

More information

DT2118 Speech and Speaker Recognition

DT2118 Speech and Speaker Recognition DT2118 Speech and Speaker Recognition Language Modelling Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 56 Outline Introduction Formal Language Theory Stochastic Language Models (SLM) N-gram Language

More information

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipop Koehn) 30 January

More information

Language Model. Introduction to N-grams

Language Model. Introduction to N-grams Language Model Introduction to N-grams Probabilistic Language Model Goal: assign a probability to a sentence Application: Machine Translation P(high winds tonight) > P(large winds tonight) Spelling Correction

More information

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model (most slides from Sharon Goldwater; some adapted from Philipp Koehn) 5 October 2016 Nathan Schneider

More information

The Noisy Channel Model and Markov Models

The Noisy Channel Model and Markov Models 1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle

More information

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview The Language Modeling Problem We have some (finite) vocabulary, say V = {the, a, man, telescope, Beckham, two, } 6.864 (Fall 2007) Smoothed Estimation, and Language Modeling We have an (infinite) set of

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing N-grams and language models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 25 Introduction Goals: Estimate the probability that a

More information

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius Doctoral Course in Speech Recognition May 2007 Kjell Elenius CHAPTER 12 BASIC SEARCH ALGORITHMS State-based search paradigm Triplet S, O, G S, set of initial states O, set of operators applied on a state

More information

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24 L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,

More information

Natural Language Processing. Statistical Inference: n-grams

Natural Language Processing. Statistical Inference: n-grams Natural Language Processing Statistical Inference: n-grams Updated 3/2009 Statistical Inference Statistical Inference consists of taking some data (generated in accordance with some unknown probability

More information

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7 Quiz 1, COMS 4705 Name: 10 30 30 20 Good luck! 4705 Quiz 1 page 1 of 7 Part #1 (10 points) Question 1 (10 points) We define a PCFG where non-terminal symbols are {S,, B}, the terminal symbols are {a, b},

More information

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri Speech Recognition Lecture 5: N-gram Language Models Eugene Weinstein Google, NYU Courant Institute eugenew@cs.nyu.edu Slide Credit: Mehryar Mohri Components Acoustic and pronunciation model: Pr(o w) =

More information

Distances between Distributions: Comparing Language Models

Distances between Distributions: Comparing Language Models Distances between Distributions: Comparing Language Models Thierry Murgue 1,2 and Colin de la Higuera 2 1 RIM, Ecole des Mines de Saint-Etienne 158, Cours Fauriel 42023 Saint-Etienne cedex 2 France 2 EURISE,

More information

Kneser-Ney smoothing explained

Kneser-Ney smoothing explained foldl home blog contact feed Kneser-Ney smoothing explained 18 January 2014 Language models are an essential element of natural language processing, central to tasks ranging from spellchecking to machine

More information

An Evolution Strategy for the Induction of Fuzzy Finite-state Automata

An Evolution Strategy for the Induction of Fuzzy Finite-state Automata Journal of Mathematics and Statistics 2 (2): 386-390, 2006 ISSN 1549-3644 Science Publications, 2006 An Evolution Strategy for the Induction of Fuzzy Finite-state Automata 1,2 Mozhiwen and 1 Wanmin 1 College

More information

Learning k-edge Deterministic Finite Automata in the Framework of Active Learning

Learning k-edge Deterministic Finite Automata in the Framework of Active Learning Learning k-edge Deterministic Finite Automata in the Framework of Active Learning Anuchit Jitpattanakul* Department of Mathematics, Faculty of Applied Science, King Mong s University of Technology North

More information

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky Language Modeling Introduction to N-grams Many Slides are adapted from slides by Dan Jurafsky Probabilistic Language Models Today s goal: assign a probability to a sentence Why? Machine Translation: P(high

More information

Chapter 3: Basics of Language Modelling

Chapter 3: Basics of Language Modelling Chapter 3: Basics of Language Modelling Motivation Language Models are used in Speech Recognition Machine Translation Natural Language Generation Query completion For research and development: need a simple

More information

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

{ Jurafsky & Martin Ch. 6:! 6.6 incl. N-grams Now Simple (Unsmoothed) N-grams Smoothing { Add-one Smoothing { Backo { Deleted Interpolation Reading: { Jurafsky & Martin Ch. 6:! 6.6 incl. 1 Word-prediction Applications Augmentative Communication

More information

Improved TBL algorithm for learning context-free grammar

Improved TBL algorithm for learning context-free grammar Proceedings of the International Multiconference on ISSN 1896-7094 Computer Science and Information Technology, pp. 267 274 2007 PIPS Improved TBL algorithm for learning context-free grammar Marcin Jaworski

More information

Theory of Languages and Automata

Theory of Languages and Automata Theory of Languages and Automata Chapter 1- Regular Languages & Finite State Automaton Sharif University of Technology Finite State Automaton We begin with the simplest model of Computation, called finite

More information

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling Masaharu Kato, Tetsuo Kosaka, Akinori Ito and Shozo Makino Abstract Topic-based stochastic models such as the probabilistic

More information

Enhancing Active Automata Learning by a User Log Based Metric

Enhancing Active Automata Learning by a User Log Based Metric Master Thesis Computing Science Radboud University Enhancing Active Automata Learning by a User Log Based Metric Author Petra van den Bos First Supervisor prof. dr. Frits W. Vaandrager Second Supervisor

More information

COM364 Automata Theory Lecture Note 2 - Nondeterminism

COM364 Automata Theory Lecture Note 2 - Nondeterminism COM364 Automata Theory Lecture Note 2 - Nondeterminism Kurtuluş Küllü March 2018 The FA we saw until now were deterministic FA (DFA) in the sense that for each state and input symbol there was exactly

More information

Lecture 3: Nondeterministic Finite Automata

Lecture 3: Nondeterministic Finite Automata Lecture 3: Nondeterministic Finite Automata September 5, 206 CS 00 Theory of Computation As a recap of last lecture, recall that a deterministic finite automaton (DFA) consists of (Q, Σ, δ, q 0, F ) where

More information

CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP

CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP University of Malta BSc IT (Hons)Year IV CSA4050: Advanced Topics Natural Language Processing Lecture Statistics III Statistical Approaches to NLP Witten-Bell Discounting Unigrams Bigrams Dept Computer

More information

CS:4330 Theory of Computation Spring Regular Languages. Finite Automata and Regular Expressions. Haniel Barbosa

CS:4330 Theory of Computation Spring Regular Languages. Finite Automata and Regular Expressions. Haniel Barbosa CS:4330 Theory of Computation Spring 2018 Regular Languages Finite Automata and Regular Expressions Haniel Barbosa Readings for this lecture Chapter 1 of [Sipser 1996], 3rd edition. Sections 1.1 and 1.3.

More information

Chapter Five: Nondeterministic Finite Automata

Chapter Five: Nondeterministic Finite Automata Chapter Five: Nondeterministic Finite Automata From DFA to NFA A DFA has exactly one transition from every state on every symbol in the alphabet. By relaxing this requirement we get a related but more

More information

Nondeterministic Finite Automata

Nondeterministic Finite Automata Nondeterministic Finite Automata Not A DFA Does not have exactly one transition from every state on every symbol: Two transitions from q 0 on a No transition from q 1 (on either a or b) Though not a DFA,

More information

10/17/04. Today s Main Points

10/17/04. Today s Main Points Part-of-speech Tagging & Hidden Markov Model Intro Lecture #10 Introduction to Natural Language Processing CMPSCI 585, Fall 2004 University of Massachusetts Amherst Andrew McCallum Today s Main Points

More information

Theoretical Computer Science

Theoretical Computer Science Theoretical Computer Science 448 (2012) 41 46 Contents lists available at SciVerse ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs Polynomial characteristic sets

More information

Natural Language Processing (CSE 490U): Language Models

Natural Language Processing (CSE 490U): Language Models Natural Language Processing (CSE 490U): Language Models Noah Smith c 2017 University of Washington nasmith@cs.washington.edu January 6 9, 2017 1 / 67 Very Quick Review of Probability Event space (e.g.,

More information

CMPT-825 Natural Language Processing

CMPT-825 Natural Language Processing CMPT-825 Natural Language Processing Anoop Sarkar http://www.cs.sfu.ca/ anoop February 27, 2008 1 / 30 Cross-Entropy and Perplexity Smoothing n-gram Models Add-one Smoothing Additive Smoothing Good-Turing

More information

Theory of Computation (II) Yijia Chen Fudan University

Theory of Computation (II) Yijia Chen Fudan University Theory of Computation (II) Yijia Chen Fudan University Review A language L is a subset of strings over an alphabet Σ. Our goal is to identify those languages that can be recognized by one of the simplest

More information

Theory of Computation (I) Yijia Chen Fudan University

Theory of Computation (I) Yijia Chen Fudan University Theory of Computation (I) Yijia Chen Fudan University Instructor Yijia Chen Homepage: http://basics.sjtu.edu.cn/~chen Email: yijiachen@fudan.edu.cn Textbook Introduction to the Theory of Computation Michael

More information

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP Recap: Language models Foundations of atural Language Processing Lecture 4 Language Models: Evaluation and Smoothing Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipp

More information

N-gram Language Modeling Tutorial

N-gram Language Modeling Tutorial N-gram Language Modeling Tutorial Dustin Hillard and Sarah Petersen Lecture notes courtesy of Prof. Mari Ostendorf Outline: Statistical Language Model (LM) Basics n-gram models Class LMs Cache LMs Mixtures

More information

Natural Language Processing SoSe Words and Language Model

Natural Language Processing SoSe Words and Language Model Natural Language Processing SoSe 2016 Words and Language Model Dr. Mariana Neves May 2nd, 2016 Outline 2 Words Language Model Outline 3 Words Language Model Tokenization Separation of words in a sentence

More information

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky Language Modeling Introduction to N-grams Many Slides are adapted from slides by Dan Jurafsky Probabilistic Language Models Today s goal: assign a probability to a sentence Why? Machine Translation: P(high

More information

What we have done so far

What we have done so far What we have done so far DFAs and regular languages NFAs and their equivalence to DFAs Regular expressions. Regular expressions capture exactly regular languages: Construct a NFA from a regular expression.

More information

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007.

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007. Proofs, Strings, and Finite Automata CS154 Chris Pollett Feb 5, 2007. Outline Proofs and Proof Strategies Strings Finding proofs Example: For every graph G, the sum of the degrees of all the nodes in G

More information

Language Models. Philipp Koehn. 11 September 2018

Language Models. Philipp Koehn. 11 September 2018 Language Models Philipp Koehn 11 September 2018 Language models 1 Language models answer the question: How likely is a string of English words good English? Help with reordering p LM (the house is small)

More information

September 7, Formal Definition of a Nondeterministic Finite Automaton

September 7, Formal Definition of a Nondeterministic Finite Automaton Formal Definition of a Nondeterministic Finite Automaton September 7, 2014 A comment first The formal definition of an NFA is similar to that of a DFA. Both have states, an alphabet, transition function,

More information

Midterm sample questions

Midterm sample questions Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts

More information

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang Deep Learning Basics Lecture 10: Neural Language Models Princeton University COS 495 Instructor: Yingyu Liang Natural language Processing (NLP) The processing of the human languages by computers One of

More information

CS243, Logic and Computation Nondeterministic finite automata

CS243, Logic and Computation Nondeterministic finite automata CS243, Prof. Alvarez NONDETERMINISTIC FINITE AUTOMATA (NFA) Prof. Sergio A. Alvarez http://www.cs.bc.edu/ alvarez/ Maloney Hall, room 569 alvarez@cs.bc.edu Computer Science Department voice: (67) 552-4333

More information

(Refer Slide Time: 0:21)

(Refer Slide Time: 0:21) Theory of Computation Prof. Somenath Biswas Department of Computer Science and Engineering Indian Institute of Technology Kanpur Lecture 7 A generalisation of pumping lemma, Non-deterministic finite automata

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured

More information

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Cross-Lingual Language Modeling for Automatic Speech Recogntion GBO Presentation Cross-Lingual Language Modeling for Automatic Speech Recogntion November 14, 2003 Woosung Kim woosung@cs.jhu.edu Center for Language and Speech Processing Dept. of Computer Science The

More information

Nondeterministic Finite Automata. Nondeterminism Subset Construction

Nondeterministic Finite Automata. Nondeterminism Subset Construction Nondeterministic Finite Automata Nondeterminism Subset Construction 1 Nondeterminism A nondeterministic finite automaton has the ability to be in several states at once. Transitions from a state on an

More information

Non-deterministic Finite Automata (NFAs)

Non-deterministic Finite Automata (NFAs) Algorithms & Models of Computation CS/ECE 374, Fall 27 Non-deterministic Finite Automata (NFAs) Part I NFA Introduction Lecture 4 Thursday, September 7, 27 Sariel Har-Peled (UIUC) CS374 Fall 27 / 39 Sariel

More information

N-gram Language Modeling

N-gram Language Modeling N-gram Language Modeling Outline: Statistical Language Model (LM) Intro General N-gram models Basic (non-parametric) n-grams Class LMs Mixtures Part I: Statistical Language Model (LM) Intro What is a statistical

More information

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi) Natural Language Processing SoSe 2015 Language Modelling Dr. Mariana Neves April 20th, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline 2 Motivation Estimation Evaluation Smoothing Outline 3 Motivation

More information

FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY

FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY 5-453 FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY NON-DETERMINISM and REGULAR OPERATIONS THURSDAY JAN 6 UNION THEOREM The union of two regular languages is also a regular language Regular Languages Are

More information

Finding the Most Probable String and the Consensus String: an Algorithmic Study

Finding the Most Probable String and the Consensus String: an Algorithmic Study Finding the Most Probable String and the Consensus String: an Algorithmic Study Colin de la Higuera and Jose Oncina Université de Nantes, CNRS, LINA, UMR6241, F-44000, France cdlh@univ-nantes.fr Departamento

More information

Sequences and Information

Sequences and Information Sequences and Information Rahul Siddharthan The Institute of Mathematical Sciences, Chennai, India http://www.imsc.res.in/ rsidd/ Facets 16, 04/07/2016 This box says something By looking at the symbols

More information

UNIT-III REGULAR LANGUAGES

UNIT-III REGULAR LANGUAGES Syllabus R9 Regulation REGULAR EXPRESSIONS UNIT-III REGULAR LANGUAGES Regular expressions are useful for representing certain sets of strings in an algebraic fashion. In arithmetic we can use the operations

More information

Introduction: Computer Science is a cluster of related scientific and engineering disciplines concerned with the study and application of computations. These disciplines range from the pure and basic scientific

More information

Fooling Sets and. Lecture 5

Fooling Sets and. Lecture 5 Fooling Sets and Introduction to Nondeterministic Finite Automata Lecture 5 Proving that a language is not regular Given a language, we saw how to prove it is regular (union, intersection, concatenation,

More information

Learnability of Probabilistic Automata via Oracles

Learnability of Probabilistic Automata via Oracles Learnability of Probabilistic Automata via Oracles Omri Guttman, S.V.N. Vishwanathan, and Robert C. Williamson Statistical Machine Learning Program National ICT Australia RSISE, Australian National University

More information

Statistical Machine Translation

Statistical Machine Translation Statistical Machine Translation Marcello Federico FBK-irst Trento, Italy Galileo Galilei PhD School University of Pisa Pisa, 7-19 May 2008 Part V: Language Modeling 1 Comparing ASR and statistical MT N-gram

More information

Formal Definition of Computation. August 28, 2013

Formal Definition of Computation. August 28, 2013 August 28, 2013 Computation model The model of computation considered so far is the work performed by a finite automaton Finite automata were described informally, using state diagrams, and formally, as

More information

UNIT-II. NONDETERMINISTIC FINITE AUTOMATA WITH ε TRANSITIONS: SIGNIFICANCE. Use of ε-transitions. s t a r t. ε r. e g u l a r

UNIT-II. NONDETERMINISTIC FINITE AUTOMATA WITH ε TRANSITIONS: SIGNIFICANCE. Use of ε-transitions. s t a r t. ε r. e g u l a r Syllabus R9 Regulation UNIT-II NONDETERMINISTIC FINITE AUTOMATA WITH ε TRANSITIONS: In the automata theory, a nondeterministic finite automaton (NFA) or nondeterministic finite state machine is a finite

More information

Exploring Asymmetric Clustering for Statistical Language Modeling

Exploring Asymmetric Clustering for Statistical Language Modeling Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL, Philadelphia, July 2002, pp. 83-90. Exploring Asymmetric Clustering for Statistical Language Modeling Jianfeng

More information

SYNTHER A NEW M-GRAM POS TAGGER

SYNTHER A NEW M-GRAM POS TAGGER SYNTHER A NEW M-GRAM POS TAGGER David Sündermann and Hermann Ney RWTH Aachen University of Technology, Computer Science Department Ahornstr. 55, 52056 Aachen, Germany {suendermann,ney}@cs.rwth-aachen.de

More information

PS2 - Comments. University of Virginia - cs3102: Theory of Computation Spring 2010

PS2 - Comments. University of Virginia - cs3102: Theory of Computation Spring 2010 University of Virginia - cs3102: Theory of Computation Spring 2010 PS2 - Comments Average: 77.4 (full credit for each question is 100 points) Distribution (of 54 submissions): 90, 12; 80 89, 11; 70-79,

More information

cse303 ELEMENTS OF THE THEORY OF COMPUTATION Professor Anita Wasilewska

cse303 ELEMENTS OF THE THEORY OF COMPUTATION Professor Anita Wasilewska cse303 ELEMENTS OF THE THEORY OF COMPUTATION Professor Anita Wasilewska LECTURE 5 CHAPTER 2 FINITE AUTOMATA 1. Deterministic Finite Automata DFA 2. Nondeterministic Finite Automata NDFA 3. Finite Automata

More information

Introduction to Languages and Computation

Introduction to Languages and Computation Introduction to Languages and Computation George Voutsadakis 1 1 Mathematics and Computer Science Lake Superior State University LSSU Math 400 George Voutsadakis (LSSU) Languages and Computation July 2014

More information

Clarifications from last time. This Lecture. Last Lecture. CMSC 330: Organization of Programming Languages. Finite Automata.

Clarifications from last time. This Lecture. Last Lecture. CMSC 330: Organization of Programming Languages. Finite Automata. CMSC 330: Organization of Programming Languages Last Lecture Languages Sets of strings Operations on languages Finite Automata Regular expressions Constants Operators Precedence CMSC 330 2 Clarifications

More information

The Turing Machine. CSE 211 (Theory of Computation) The Turing Machine continued. Turing Machines

The Turing Machine. CSE 211 (Theory of Computation) The Turing Machine continued. Turing Machines The Turing Machine Turing Machines Professor Department of Computer Science and Engineering Bangladesh University of Engineering and Technology Dhaka-1000, Bangladesh The Turing machine is essentially

More information

Nondeterministic finite automata

Nondeterministic finite automata Lecture 3 Nondeterministic finite automata This lecture is focused on the nondeterministic finite automata (NFA) model and its relationship to the DFA model. Nondeterminism is an important concept in the

More information

CISC 4090 Theory of Computation

CISC 4090 Theory of Computation 9/2/28 Stereotypical computer CISC 49 Theory of Computation Finite state machines & Regular languages Professor Daniel Leeds dleeds@fordham.edu JMH 332 Central processing unit (CPU) performs all the instructions

More information

Takeaway Notes: Finite State Automata

Takeaway Notes: Finite State Automata Takeaway Notes: Finite State Automata Contents 1 Introduction 1 2 Basics and Ground Rules 2 2.1 Building Blocks.............................. 2 2.2 The Name of the Game.......................... 2 3 Deterministic

More information

Nondeterminism. September 7, Nondeterminism

Nondeterminism. September 7, Nondeterminism September 7, 204 Introduction is a useful concept that has a great impact on the theory of computation Introduction is a useful concept that has a great impact on the theory of computation So far in our

More information

ANLP Lecture 6 N-gram models and smoothing

ANLP Lecture 6 N-gram models and smoothing ANLP Lecture 6 N-gram models and smoothing Sharon Goldwater (some slides from Philipp Koehn) 27 September 2018 Sharon Goldwater ANLP Lecture 6 27 September 2018 Recap: N-gram models We can model sentence

More information

CS 455/555: Finite automata

CS 455/555: Finite automata CS 455/555: Finite automata Stefan D. Bruda Winter 2019 AUTOMATA (FINITE OR NOT) Generally any automaton Has a finite-state control Scans the input one symbol at a time Takes an action based on the currently

More information

Finite Automata. Seungjin Choi

Finite Automata. Seungjin Choi Finite Automata Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 / 28 Outline

More information

UNIT-I. Strings, Alphabets, Language and Operations

UNIT-I. Strings, Alphabets, Language and Operations UNIT-I Strings, Alphabets, Language and Operations Strings of characters are fundamental building blocks in computer science. Alphabet is defined as a non empty finite set or nonempty set of symbols. The

More information

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs (based on slides by Sharon Goldwater and Philipp Koehn) 21 February 2018 Nathan Schneider ENLP Lecture 11 21

More information

DESCRIPTIONAL COMPLEXITY OF NFA OF DIFFERENT AMBIGUITY

DESCRIPTIONAL COMPLEXITY OF NFA OF DIFFERENT AMBIGUITY International Journal of Foundations of Computer Science Vol. 16, No. 5 (2005) 975 984 c World Scientific Publishing Company DESCRIPTIONAL COMPLEXITY OF NFA OF DIFFERENT AMBIGUITY HING LEUNG Department

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Language Models, Graphical Models Sameer Maskey Week 13, April 13, 2010 Some slides provided by Stanley Chen and from Bishop Book Resources 1 Announcements Final Project Due,

More information

Theory of Computation

Theory of Computation Thomas Zeugmann Hokkaido University Laboratory for Algorithmics http://www-alg.ist.hokudai.ac.jp/ thomas/toc/ Lecture 3: Finite State Automata Motivation In the previous lecture we learned how to formalize

More information

Advanced Automata Theory 7 Automatic Functions

Advanced Automata Theory 7 Automatic Functions Advanced Automata Theory 7 Automatic Functions Frank Stephan Department of Computer Science Department of Mathematics National University of Singapore fstephan@comp.nus.edu.sg Advanced Automata Theory

More information

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya Week 13: Language Modeling II Smoothing in Language Modeling Irina Sergienya 07.07.2015 Couple of words first... There are much more smoothing techniques, [e.g. Katz back-off, Jelinek-Mercer,...] and techniques

More information

Automata Theory for Presburger Arithmetic Logic

Automata Theory for Presburger Arithmetic Logic Automata Theory for Presburger Arithmetic Logic References from Introduction to Automata Theory, Languages & Computation and Constraints in Computational Logic Theory & Application Presented by Masood

More information

Expectation Maximization (EM)

Expectation Maximization (EM) Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted

More information

CS 154, Lecture 2: Finite Automata, Closure Properties Nondeterminism,

CS 154, Lecture 2: Finite Automata, Closure Properties Nondeterminism, CS 54, Lecture 2: Finite Automata, Closure Properties Nondeterminism, Why so Many Models? Streaming Algorithms 0 42 Deterministic Finite Automata Anatomy of Deterministic Finite Automata transition: for

More information

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers

More information

On Homogeneous Segments

On Homogeneous Segments On Homogeneous Segments Robert Batůšek, Ivan Kopeček, and Antonín Kučera Faculty of Informatics, Masaryk University Botanicka 68a, 602 00 Brno Czech Republic {xbatusek,kopecek,tony}@fi.muni.cz Abstract.

More information

Advanced Automata Theory 10 Transducers and Rational Relations

Advanced Automata Theory 10 Transducers and Rational Relations Advanced Automata Theory 10 Transducers and Rational Relations Frank Stephan Department of Computer Science Department of Mathematics National University of Singapore fstephan@comp.nus.edu.sg Advanced

More information

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk Language Modelling 2 A language model is a probability

More information

Finite-State Machines (Automata) lecture 12

Finite-State Machines (Automata) lecture 12 Finite-State Machines (Automata) lecture 12 cl a simple form of computation used widely one way to find patterns 1 A current D B A B C D B C D A C next 2 Application Fields Industry real-time control,

More information

Closure under the Regular Operations

Closure under the Regular Operations September 7, 2013 Application of NFA Now we use the NFA to show that collection of regular languages is closed under regular operations union, concatenation, and star Earlier we have shown this closure

More information

Using Multiplicity Automata to Identify Transducer Relations from Membership and Equivalence Queries

Using Multiplicity Automata to Identify Transducer Relations from Membership and Equivalence Queries Using Multiplicity Automata to Identify Transducer Relations from Membership and Equivalence Queries Jose Oncina Dept. Lenguajes y Sistemas Informáticos - Universidad de Alicante oncina@dlsi.ua.es September

More information

Chapter 3: Basics of Language Modeling

Chapter 3: Basics of Language Modeling Chapter 3: Basics of Language Modeling Section 3.1. Language Modeling in Automatic Speech Recognition (ASR) All graphs in this section are from the book by Schukat-Talamazzini unless indicated otherwise

More information

Finite-State Transducers

Finite-State Transducers Finite-State Transducers - Seminar on Natural Language Processing - Michael Pradel July 6, 2007 Finite-state transducers play an important role in natural language processing. They provide a model for

More information

Lecture 2: Connecting the Three Models

Lecture 2: Connecting the Three Models IAS/PCMI Summer Session 2000 Clay Mathematics Undergraduate Program Advanced Course on Computational Complexity Lecture 2: Connecting the Three Models David Mix Barrington and Alexis Maciel July 18, 2000

More information