FINITE STATE LANGUAGE MODELS SMOOTHED USING n-grams

Size: px

Start display at page:

Download "FINITE STATE LANGUAGE MODELS SMOOTHED USING n-grams"

Christian Moore
5 years ago
Views:

1 International Journal of Pattern Recognition and Artificial Intelligence Vol. 16, No. 3 (2002) c World Scientific Publishing Company FINITE STATE LANGUAGE MODELS SMOOTHED USING n-grams DAVID LLORENS and JUAN MIGUEL VILAR Departament de Llenguatges i Sistemes Informàtics, Universitat Jaume I de Castelló, Spain dllorens@lsi.uji.es jvilar@lsi.uji.es FRANCISCO CASACUBERTA Dpt. de Sistemes Informàtics i Computació, Institut Tecnològic d Informàtica, Universitat Politècnica de València, Spain fcn@iti.upv.es We address the problem of smoothing the probability distribution defined by a finite state automaton. Our approach extends the ideas employed for smoothing n-gram models. This extension is obtained by interpreting n-gram models as finite state models. The experiments show that our smoothing improves perplexity over smoothed n-grams and Error Correcting Parsing techniques. Keywords: Language modeling; smoothing; stochastic finite state automata. 1. Introduction In different tasks like speech recognition, OCR, or translation, it is necessary to estimate the probability of a given sentence. For example, in speech recognition, we are given an utterance u (a sequence of vectors representing the sounds emitted by the speaker) and we need to compute the most probable sentence ŵ given u. That is ŵ = arg max P (w u). w Using Bayes, and since u is given, we can transform this into ŵ = arg max P (w)p (u w). w The term P (w) isknownaslanguage model and intuitively represents the a priori probability of utterance of w. Ideally, these models should represent the language Work partially funded by Fundación Bancaja s Neurotrad project (P1A99-10). 275

2 276 D. Llorens, J. M. Vilar & F. Casacuberta under consideration and they should not allow ungrammatical sentences. In practice, it is adequate that they define a probability distribution over the free monoid of all possible sentences. This distribution is expected to be non-null at each point. The process of modifying a given distribution to be non-null is traditionally called smoothing. A widely used language model is the so-called n-gram. These models estimate the probability of a word considering only the previous n 1 words. A large range of different smoothing techniques exist for n-gram models. 2 However, when the distribution is defined by a probabilistic finite state automaton, the situation is very different: the only non-trivial smoothing in the literature is Error Correcting Parsing. 4 We present an approach to smoothing finite state models that is based on the techniques used for smoothing n-gram models. This is done by first interpreting the n-gram models as finite state automata. This interpretation is extended to the smoothing techniques. The smoothing is then modified in order to adapt it for general finite state models. The basic idea underlying the smoothing is to parse the input sentence using the original automaton as long as possible, resorting to an n-gram (possibly smoothed) when arriving to a state with no adequate arcs for the current input word. After visiting the n-gram, control returns to the automaton. This is similar to backoff smoothing, where the probability for a word is given from its n 1 predecessors when possible; if it is not possible, only n 2 predecessors are considered, and so on. An example may help in clarifying this. Suppose that the corpus has only these two sentences: I saw two new cars you drove a new car Assume also that the learning algorithm yields the model at the top in Fig. 1. The bigram model for the corpus is seen in the middle of that figure, and the unigram on the bottom (the arc labeled Σ stands for the whole vocabulary). We have to estimate now the probability for the new sentence I drove a new car. The parsing starts in the initial state of our model. We follow the arc to state B. Now, there are no output arcs with label drove, sowehavetoresorttothe bigram. We go down to the state corresponding to I (the history so far). As there is also no arc, we go down again to the unigram. Now, using the probability of the unigram, we can go up to the state in the bigram corresponding to drove. When the word a is seen, we can return to the original automaton. We do this by observing that the history so far is drove a and that state I has that history. The words new and car can be treated by the automaton.

3 Unigram Bigram Automaton Smoothing Using n-grams 277 saw two new I B C D E cars A you drove a new G H I J car F a saw two I new cars you drove a new car drove ± Fig. 1. Parsing of I drove a new car by an automaton smoothed using a bigram, which in turn is smoothed by an unigram. Dotted lines represent arcs of the models not used in the parsing. 2. Basic Concepts and Notation In this section, we introduce some basic concepts in order to fix the notation. An alphabet is a finite set of symbols (words), we represent them by calligraphic letters like X. Strings (or sentences) are concatenations of symbols and represented using a letter and a small bar, like in x. The individual words are designated by the name of the sentence and a subindex indicating the position, so x = x 1 x 2...x n. The length of a sentence is indicated by x. Segments of a sentence are denoted by x j i = x i...x j. For substrings of the form x x i we use the notation x i.thesetofall strings over X is represented by X. The empty string is represented by λ. A probabilistic finite state automaton η is a sixtuple (X,Q,q 0,δ,π,ψ)where: Xis an alphabet. Q is a finite set of states. q 0 is the initial state (it belongs to Q).

4 278 D. Llorens, J. M. Vilar & F. Casacuberta δ is the set of transitions. π is the function that assigns probabilities to transitions. ψ is the function that assigns to the states the probability of being final. The set δ is a subset of Q (X {λ}) Q. This represents the possible movements for the automaton. The elements of δ are assigned probabilities by π, a complete function from δ into the real numbers (usually, it will be between 0 and 1, but we will relax that somehow). Abusing notation, we will write δ(q, x) fortheset {q Q (q, x, q ) δ}. If δ(q, x) = 1, we will write π(q, x) forπ(q, x, δ(q, x)). An automaton can be used for defining a probability distribution on X. First define a path from q to q with input x to be a sequence (q 1,x 1,q 2 ) (q 2,x 2,q 3 )...(q n,x n,q n+1 ) such that q 1 = q, q n+1 = q and x = x 1...x n (note that n may be larger that x sincesomeofthex i may be λ). Define the probability of the path to be n i=1 π(q i,x i,q i+1 ). Now, for a string x, define its probability P η ( x) to be the sum of the products of the probabilities of all the paths from q 0 to some q with input x times the final probability of q. Symbolically: P η ( x) = c π(q i,x i,q i+1 ) ψ(q c +1 ), c C(q 0, x) i=1 where C(q 0, x) is the set of paths starting in q 0 and having x as input. The wellknown forward algorithm can be used for computing this quantity. If there are no λ-arcs in the automaton, we can use the version of Fig. 2(c). The absence of λ-arcs is not an important restriction since in this paper we will smooth automata without such arcs. Furthermore, an equivalent automaton without λ-arcs can always be obtained. During the rest of the paper, we will assume that we are working with a training corpus for building the models. We use the function N to represent the number of times that certain string appears in that corpus. 3. The n-gram Model as a Finite State Model A language model computes the probability of a sentence x. A usual decomposition of this probability is P ( x) =P (x 1 )P (x 2 x 1 ) P(x n x n 1 1 ). We can say then that the language model computes the probability that a word x follows a string of words q, which is the history of that word. Since it is impossible to accurately estimate P (x q) for all possible q, some restriction has to be placed. An n-gram model considers only the last n 1 words of q. But, since even that may be too long, in case it is not possible to estimate P (x q q n+1) directly from the training corpus, the backoff-smoothed (BS) n-gram model uses the last n 2 words and so on until arriving to an unigram model. 5 More formally: { P (x q) N( qx) > 0, P BO (x q) = C q P BO (x q 2) N( qx) =0,

5 Smoothing Using n-grams 279 where P is the discounted probability of P obtained with the chosen discount strategy and C q is a normalization constant chosen so that P BO (x q) =1. The CMU-Cambridge Toolkit, 3 is a well-known tool for building BS n-gram models The n-gram model as a deterministic automaton We can represent a backoff model by a deterministic automaton η = (X,Q η, q 0,δ η,π η,ψ η ), as follows. The set of states will be Q η = { q q (X {#}) n 1 } {λ}. That is, we have a state for each possible n-gram plus an initial state (# n ). The special symbol # is not part of the vocabulary and is used to simplify the notation. The states represent the history used so far. A string w 1 w n 1 represents the fact that the last n 1 words were w 1,...,w n 1.On the other hand, a string # #w represents the moment in which the first word has been read and the next has to be predicted. Finally, the end of the sentence corresponds to the automata predicting # for a given history. Note that in this manner, all the histories have the same length. The transitions will be δ η ( q, x) =( qx) 2. The function π will reflect the probabilities of the backoff model, π η ( q, x) =P BO (x q). Finally, ψ η ( q) =P BO (# q), assigns the discounted final state probability. It is trivial to see that this deterministic automaton is proper, so it is consistent using traditional deterministic parsing [Fig. 2(a)]. Moreover, the probability distribution induced is the same as for the original model. x X 3.2. The n-gram model as a nondeterministic automaton The representation of the previous section can be too large. An equivalent, but much smaller, model can be obtained by using a nondeterministic automaton. For this, we define the automaton η =(X,Q η, q 0,δ η,π η,ψ η ), as follows. The set of states will be Q η = { q q (X {#}) <n,n( q) > 0} {λ}. Sowehaveastatefor each k-gram seen in the training (1 <= k<n) plus a state for the unigram. These states represent the history used so far. Now, the transitions will be: ( qx) 2 if q = n 1 x X, δ η ( q,x) = qx if q <n 1 x X, q 2 if q λ x = λ. The nonempty transitions represent the way the history is updated with the new words, the empty transitions from the k-grams to the corresponding (k 1)-grams represent the backing-off. The function π will reflect the probabilities of the backoff model: { P (x q) if x X, π η ( q, x) = (1) C q if x = λ. Finally, ψ η ( q) =P (# q), assigns the discounted final state probability.

6 280 D. Llorens, J. M. Vilar & F. Casacuberta As this automaton is not proper, the language it generates is not consistent using standard parsing. Instead, we should use what we call deterministic-backoff parsing. The idea is that given a state q and a word x, ifδ( q, x) exists, it is used, if not, the λ-arc is used and the process is repeated in q 2. This algorithm can be seen in Fig. 2(b). It can be easily proven that this automaton using deterministic-backoff parsing is equivalent to the traditional backoff n-gram model The GIS n-gram model If we do not insist in having exactly the same probability as the backoff model, a new smoothing method can be derived from (1). This is done by choosing C q such that the model is consistent; the appropriate value is simply C q =1 x X π η( q, x). This ensures that the model is consistent and no special parsing is needed (i.e. forward parsing works correctly). We call this new model GIS n-gram (General Interpolation Smoothing). Note that, for n = 2, this model is the same as the model obtained by interpolating the bigram with the unigram (the traditional interpolated model 2 ), however, for n>2 both models are different. 4. Automata Smoothing Using n-grams (SUN) Suppose that you have both an stochastic (possibly nondeterministic) automaton τ and a smoothed n-gram model η. Itistemptingtouseη for smoothing τ. We present here a way to do this. The idea is to create arcs between τ and η. These new arcs come in two groups: the down arcs and the up arcs. The down arcs are λ-arcs from τ into η, they are used when the current word has no arc from the current state. The probability of these arcs is discounted from the original arcs of τ. The up arcs, are used to return to τ and they will distribute the original probabilities of η. In order to present the construction, we need the concept of set of histories of length k for a state q. Thisis H τ,k (q) ={# l h # l X k l q 0 h q} { h X k p Q τ : p h q}. Intuitively, this represents those strings of length k that are suffixes of a path leading to q. In case there is a path from the initial state shorter than k, the corresponding string is padded at the beginning with symbols #. It is also useful to define the pseudo-inverse function H 1 τ,k ( v) which is the set of states in Q τ that have the k length suffix of string v in their histories. These functions are easily extended to sets: H τ,k (Q) = H τ,k (q), Q P(Q), q Q H 1 τ,k (H) = v H H 1 τ,k ( v), H P(X k ).

7 Smoothing Using n-grams 281 Algorithm Deterministic parsing Input string μx, automaton Output probability ofμx q := q 0 ; p := 1; for i := 1 to jμxj do p := p ß(q; x i ); q := ffi(q; x i ); end for p := p ψ(q); return p; End (a) Deterministic parsing Algorithm Determ.-backoff parsing Input string μx, automaton Output probability ofμx q := q 0 ; p := 1; for i := 1 to jμxj do while ffi(q; x i )=; do q := ffi(q; ) p := p ß(q; ); end while p := p ß(q; x i ); q := ffi(q; x i ); end for p := p ψ(q); return p; End (b) Deterministic-backoff parsing Algorithm Forward parsing Input string μx, automaton Output probability ofμx P 1 [q 0 ]:=1;V 1 = fq 0 g; for i := 1 to jμxj do V 2 = ;; P 2 := [0]; for q 2 V 1 do p := P 1 [q]; for q 0 2 ffi(q; x i ) do V 2 := V 2 [fq 0 g; P 2 [q 0 ]:=P 2 [q 0 ]+p ß(q; x i ;q 0 ); end for end for V 1 := V 2 ; P 1 := P 2 ; end for return P q2v2 P 2[q] ψ(q); End (c) Forward parsing Algorithm Forward-backoff parsing Input string μx, automaton Output probability ofμx P 1 [q 0 ]:=1;V 1 = fq 0 g; for i := 1 to jμxj do V 2 = ;; P 2 := [0]; for q 2 V 1 do p := P 1 [q]; while ffi(q; x i )=; do q := ffi(q; ) p := p ß(q; ); end while for q 0 2 ffi(q; x i ) do V 2 := V 2 [fq 0 g; P 2 [q 0 ]:=P 2 [q 0 ]+p ß(q; x i ;q 0 ); end for end for V 1 := V 2 ; P 1 := P 2 ; end for return P q2v2 P 2[q] ψ(q); End (d) Forward-backoff parsing Fig. 2. The parsing algorithms. The areas marked correspond to the (small) differences between the conventional and the backoff versions. The rest of the algorithms is unchanged. The rest of this section is divided into three parts: first, we present the formal models; after that, a particular set of probabilities for doing the smoothing is explained; finally, we introduce the construction of automata with an interesting property: the set of histories of each state is a singleton. This makes them somehow analogous to the n-grams and facilitates smoothing.

8 282 D. Llorens, J. M. Vilar & F. Casacuberta 4.1. Definition of the models The set of states of the smoothed model is simply Q τ Q η, the union of the states of both models (without loss of generality, we assume that they are disjoint). There are four types of arcs: Down arcs (δ d ). Up arcs (δ u ). Stay arcs (δ η). The original arcs of τ (δ τ ). As commented above, down arcs go from τ into η. Suppose that the analysis is in state q and the following word has no arc departing from q. Inthiscase,it is sensible to resort to η. We can only tell that the path leading to q has followed one of the histories of length n 1. So we will have a λ-arc corresponding to each of these histories. This arc will go to the only state in η having that history (since η is an n-gram there can be at most one state for each history of length n 1 and, assuming the same training data for τ and η, is safe to expect that such state actually exists). Formally: δ d = {(q, λ, h) q Q τ h H τ,n 1 (q)}. Notice that we are identifying the states in Q η (the states from the smoothed n-gram model η) with the strings leading to them. This can be done as in the construction presented in Sec Once the analysis has visited η, it should go back to τ. Thisisdonebymeans of the up arcs. The idea is better explained from a particular state in Q η and a symbol x. Sinceη is an n-gram there is only a history h arriving at it. When we join it with the symbol under consideration, we get a longer history. The up arc is drawn from q into those states having hx in their set of histories. This seeks to ensure that the analysis returns to a sensible point in τ. In a more formal way: δ u = {( h, x, q) Q η X Q τ q Hτ,n( hx)} 1. Note that there may be symbols for which the extended history does not belong to any state in τ. This gives rise to stay arcs, which are the arcs in η that do not generate up arcs: δ η = {( h, x, h ) δ η δ u ( h, x) = }. Finally, all the original arcs of τ are also part of the model (δ τ ). In parallel with the definition of the arcs, we find the definition of four sets of probabilities. The down arcs get their probabilities after discounting some probability mass from the arcs of τ. We use certain function d : δ τ [0, 1] for this discounting. The remaining probability of each state is distributed among the corresponding down arcs. This is done by function b : δ d [0, 1], which must fulfill: b(q, λ, h) =1. (2) h δ d (q,λ)

9 Smoothing Using n-grams 283 With these two functions, we can define π τ, the probability of the arcs of τ in the new model as π τ (q, x, q )=d(q, x, q )π(q, x, q ), for (q, x, q ) δ τ. And the definition of π d, the probability of the down arcs is π d (q, λ, h) =C q b(q, λ, h), for (q, λ, h) δ d. The normalization constant C q is computed in a manner analogous to the n-gram backoff smoothing. The up arcs will get their probabilities by distributing the probability of the arc that originated them. For this, we use function s : δ u [0, 1], that must fulfill the following s( h, x, q) =π η ( h, x). (3) q δ u( h,x) With this function, the probabilities of the up arcs are trivial, simply define π u ( h, x, q) =s( h, x, q), for ( h, x, q) δ u. Finally, the stay arcs keep their original probabilities, so π η ( h, x, h )=π η ( h, x, h ), for ( h, x, h ) δ η. We define SUN (τ,η,d,b,s) as the automaton (X,Q, q 0,δ,π,ψ)where: (a) The states, Q, are the union of the states from τ and η: Q = Q τ Q η. (b) The transitions are the union of the four sets of arcs: δ = δ τ δ d δ u δ η. (c) The probabilities of the arcs are the union of the four sets of probabilities: π = π τ π d π u π η, (d) the function ψ, which assigns the final probability to every state, is: { ψ τ (q) if q Q τ, ψ(q) = ψ η (q) if q Q η, where ψ τ (q) is the discounted final probability. This automaton, which we call SUN BS, is neither deterministic nor proper. To make the defined language consistent, a special parsing is required. We present the Forward-backoff parsing, which is a modified forward parsing that does not use λ-arcs in an active state if there exists an arc labeled with the next input symbol [Fig. 2(d)]. As in n-gram backoff smoothing, we can obtain a SUN GIS automaton by choosing C q so that the automaton is proper. The language it models is consistent using forward parsing. The methods SUN BS and SUN GIS constitute a generalization of the n-gram smothing techniques BS and GIS, respectively. This can be seen in the fact that if an n-gram is smoothed with an smoothed (n 1)-gram using SUB BS (respectively, SUN GIS), the result is the same of a BS (respectively GIS) model.

10 284 D. Llorens, J. M. Vilar & F. Casacuberta 4.2. Choosing the distributions The formalization of the SUN allows different parameterizations for the functions d, b and s. Inn-gram smoothing literature there are several discounting techniques that can be easily adapted to play the role of d. We have chosen the Witten Bell discounting, 6 which is powerful and simple to implement. Given an automaton τ: π τ (q, x, q )=d(q, x, q ) π τ (q, x, q N(q) N(q, x, q ) )= = N(q, x, q ) N(q)+D(q) N(q) N(q)+D(q), where N(q, x, q )andn(q) are the number of times this arc/state was used on analyzing the training corpus, and D(q) is the number of different labeled arcs following state q (plus one if it is final). The smoothed probability of final state is ψ τ (q) = N f (q) N(q)+D(q),whereN f (q) is the number of times q was final when analyzing the training corpus. We feel that functions b and s should take into account the frequency of the different histories in each state. This information can be easily obtained using a history counting function over a training corpus. As a way of distributing the probabilities in accordance with these counts, we define b(q, λ, h) ands( h, x, q) as: b(q, λ, h) = E τ,q ( h) E τ,q ( h ), h H τ,n(q) s( h, x, q) =π η ( h, x) E τ,q ( hx) E τ,q ( hx), q H 1 τ,n( hx) where E τ,q ( h) is the number of times we arrive at state q of τ with history h. Itcan be easily proven that both b and s satisfy their respective restrictions, (2) and (3) Single history automata A problem for SUN is the existence of more than one history of length n 1in the states of the automaton. Remember that the n-gram predicts the next word by considering the previous n 1. Now, consider state q and suppose it has three histories h 1, h 2,andh 3 of length n 1. This implies the construction of three down arcs. But, when actually using the automaton, at most one of the histories is the one seen, so the other two are used for paths that lead to states of the n-gram with histories not belonging to these states. It would be desirable to have only one history per state, so that it is possible to go down to the state with the correct history. We can in fact do that for general automata. First, define τ to be an n-sh automaton if for every state q the set H τ,n (q) is a singleton. The interesting result is that every automaton can be converted to an equivalent n-sh by a simple

11 Smoothing Using n-grams 285 construction. Given τ, the corresponding n-sh automaton is τ with: Q τ = {(q, h) h H τ,n (q)}. q 0,τ =(q 0, # n ). δ τ = {((q, h),x,(q, ( hx) 2 )) (q, x, q ) δ τ h H τ,n (q)}. With respect to probabilities, this conversion can be done in two manners. The first one is to translate the probabilities so that the new automaton represents the same distribution. The other one, which we have followed, takes into account that the original distribution was inferred from a training corpus and is therefore only an approximation. We feel that it is better to reestimate the probabilities on the new automaton. As seen in the experiments below, the best approach for smoothing the automaton is to first convert it in an n-sh, then reestimate the probabilities, and after that to use the SUN. 5. Experiments To test the SUN we use ATIS-2, a subcorpus of ATIS, which is a set of sentences recorded from speakers requesting information to an air travel database. ATIS-2 contains a training set of 13,044 utterances (130,773 tokens), and two evaluation sets of 974 utterances (10,636 tokens) and 1,001 utterances (11,703 tokens), respectively. The vocabulary contains 1,294 words. We use the second evaluation set as test set. The first evaluation set was not used. However, the Error Correcting Parsing (ECP) approach we compare with uses this set to estimate the error model. Three aspects were considered for the experiments: SUN versus other automata smoothing techniques; SUN GIS versus SUN BS; and automata smoothing versus n-gram smoothing. The experiments consisted in building the smoothed unigram, bigram and trigram for the task (using both GIS and BS smoothing), plus an automaton obtained using the ALERGIA algorithm. 1 This automaton was smoothed using direct smoothing with a unigram, Error Correcting Parsing, and the SUN approach. The results of the first two methods constitute the reference and they were presented by Dupont and Amengual. 4 The SUN approach was tested with the different n-grams and using both the original automaton and the corresponding n-shs. Table 1 shows the results of the experiments. The reference results are those presented by Dupont and Amengual. 4 Their experiments were done with the same corpus and the same ALERGIA automaton, so our results are comparable with theirs. They smoothed the automaton using an unigram and different ECP models (ECP smoothing does not use n-grams). They obtained a perplexity of 71 with the unigram smoothing and of 37 with the best ECP model, nearly a 50% improvement but still far from the perplexity of bigrams or trigrams.

12 286 D. Llorens, J. M. Vilar & F. Casacuberta Table 1. Test set perplexity results. Column SUN SH (SUN using single history automata) shows two results: using all paths and only the best one. (a) GIS (b) BS (c) Reference n n-gram SUN SUN SH n n-gram SUN SUN SH Direct ECP / / / / / /15 Table 2. Number of arcs of the automaton and of the n-gram traversed during the parsing of the test using only the best path. The last column shows the number of down arcs employed. The number of symbols in the test is (a) GIS (b) BS n automaton n-gram down arcs n automaton n-gram down arcs l Our results improve theirs. The use of SUN GIS with unigrams slightly improves perplexity. When using bigrams or more, the results are far better (nearly a 30% improvement over ECP for trigrams), but still worse than n-grams. When the automaton is transformed into its equivalent n-sh, the results are even better than for bigrams. For trigrams they are equivalent. When using BS, the conclusions are very similar except for the case of the automaton not transformed in n-sh. In the last column of Table 1, we can see the effect of using only the best path instead of summing all possible paths. This effect is negligible for SUN BS but important for SUN GIS. The difference is due to the large number of paths accepting a sentence in SUN GIS. Another important question is whether the parsing uses more arcs from the automaton or from the n-gram. It would be desirable that most of the arcs correspond to the automaton. In Table 2, we can see the numbers for the best paths. Both for SUN BS and SUN GIS, most of the symbols are parsed in the automaton. The ratio decreases as the degree of the n-gram increases. This shows that the better the smoothing model, the more it is used. Using SUN BS and unigrams the percentage of symbols parsed by the automaton is 91%, for bigrams 82% and for trigrams 67%. On the other hand, the percentage of words using down arcs is 5% for unigrams, 10% for bigrams and 13% for trigrams. The behavior of SUN GIS is similar. For comparison purposes, we have repeated the tests of BS n-gram models using the CMU-Cambridge Toolkit 3 with the same discounting function. As expected, the results are the same as the second column in Table 1(b).

13 Smoothing Using n-grams Conclusions Smoothed n-grams can be formalized as finite state models. The formalization has been used in two directions: for defining a new smoothing for n-grams, GIS; and for extending backoff smoothing and GIS to general automata. The extension of the smoothing to n-grams has been made using several new concepts also introduced here. In the first place, the definition of the sets of histories of a state is used in order to define the relationship between the states of the automaton and the n-gram used for smoothing. Related to that, we present a transformation of automata (n-sh) that makes them conceptually nearer to the structure of the n-grams, therefore easing the smoothing. Finally, appropriate modifications to the parsing algorithms are presented in order to cope with the new automata. The experimental results indicate that the methods obtain a suitable smoothing. This is translated in better perplexities than the models used for smoothing them, while keeping the structure of the automaton, allowing for modeling more complex relationships than possible with n-grams. References 1. R. C. Carrasco and J. Oncina, Learning deterministic regular grammars from stochastic samples in polynomial time, Theoret. Inform. Appl. 33, 1 (1999) S. F. Chen and J. Goodman, An empirical study of smoothing techniques for language modeling, Comput. Speech Lang. 13, 4 (1999) P. Clarkson and R. Rosenfeld, Statistical language modeling using the CMU-Combridge Toolkit, EUROSPEECH, 1997, pp P. Dupont and J.-C. Amengual, Smoothing probabilistic automata: an errorcorrecting approach, Grammatical Inference: Algorithma and Applications, ed. A. de Oliveira, Lecture Notes in Artificial Intelligance, Vol. 1891, Springer-Verlag, 2000, pp S. M. Katz, Estimation of probabilities from sparse data for the language model component of a speech recognizer, IEEE Trans. ASSP 34, 3 (1987) P. Placeway, R. Schwartz, P. Fung and L. Nguyen, The estimation of powerful language models from small and large corpora, ICASSP 93, Vol. II, 1993, pp

288 D. Llorens, J. M. Vilar & F. Casacuberta David Llorens received the Master and Ph.D. degrees in computer science from the Polytechnic University of València, Spain, in 1995 and 2000, respectively.

Since 1997, he has been with the Department of Software and Computing Systems of the University Jaume I of Castelló, Spain, first as Assistant Professor and from 2001 as Associate Professor.

14 288 D. Llorens, J. M. Vilar & F. Casacuberta David Llorens received the Master and Ph.D. degrees in computer science from the Polytechnic University of València, Spain, in 1995 and 2000, respectively. From 1995 to 1997 he worked with the Department of Information Systems and Computation, Polytechnic University of València, first as a predoctoral fellow and from 1996 as Assistant Professor. Since 1997, he has been with the Department of Software and Computing Systems of the University Jaume I of Castelló, Spain, first as Assistant Professor and from 2001 as Associate Professor. He has been an active member of the Pattern Recognition and Artificial Intelligence Group in the Department of Information Systems and Computation, and now he works with the Group for Computational Learning, Automatic Recognition and Translation of Speech at the Department of Software and Computing Systems of the University Jaume I of Castelló, Spain. His current research lies in the areas of finite state language modeling, language model smoothing, speech recognition, machine translation and syntactic pattern recognition. Juan Miguel Vilar received the B.Sc. in computer studies in 1990 from the Liverpool Polytechnic (now the John Moore s University) in England, and the M.Sc. and Ph.D. degrees in computer science in 1993 and 1998 from the Polytechnic University of Valencia in Spain. He is currently working as an Associate Professor with the Department of Software and Computing Systems of the University Jaume I of Castelló, Spain. He has been an active member of the Pattern Recognition and Artificial Intelligence Group in the Department of Information Systems and Computation and now works with the Group for Computational Learning, Automatic Recognition and Translation of Speech at the Department of Software and Computing Systems of the University Jaume I of Castelló, Spain. His current research is in finite state and statistical models for machine translation, computational learning, language modeling, and word clustering.

Smoothing Using n-grams 289 Francisco Casacuberta received the Master and Ph.D. degrees in physics from the University of València, Spain, in 1976 and 1981, respectively.

15 Smoothing Using n-grams 289 Francisco Casacuberta received the Master and Ph.D. degrees in physics from the University of València, Spain, in 1976 and 1981, respectively. From 1976 to 1979, he worked with the Department of Electricity and Electronics at the University of València as an FPI fellow. From 1980 to 1986, he was with the Computing Center of the University of València. Since 1980, he has been with the Department of Information Systems and Computation of the Polytechnic University of València first as a Associate Professor and from 1990 as a Full Professor. Since 1981, he has been an active member of a research group in the fields of automatic speech recognition and machine translation. Dr. Casacuberta is a member of the Spanish Society for Pattern Recognition and Image Analysis (AERFAI), which is an affiliate society of IAPR, the IEEE Computer Society and the Spanish Association for Artificial Intelligence (AEPIA). His current research interest lies in the areas of speech recognition, machine translation, syntactic pattern recognition, statistical pattern recognition and machine learning.

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3) Chapter 5 Language Modeling 5.1 Introduction A language model is simply a model of what strings (of words) are more or less likely to be generated by a speaker of English (or some other language). More