Formal Modeling in Cognitive Science Lecture 29: Noisy Channel Model and Applications;

Formal Modeling in Cognitive Science Lecture 9: and ; ; Frank Keller School of Informatics University of Edinburgh keller@inf.ed.ac.uk Proerties of 3 March, 6 Frank Keller Formal Modeling in Cognitive Science Frank Keller Formal Modeling in Cognitive Science Proerties of Proerties of So far, we have looked at encoding a message efficiently, ut what about transmitting the message? The transmission of a message can be ed using a noisy channel: a message W is encoded, resulting in a string X ; X is transmitted through a channel with the robability distribution f (y x); the resulting string Y is decoded, yielding an estimate of the message Ŵ. W Message Encoder X Channel f(y x) Y W^ Estimate of message Frank Keller Formal Modeling in Cognitive Science 3 We are interested in the mathematical roerties of the channel used to transmit the message, and in articular in its caacity. Definition: Discrete Channel A discrete channel consists of an inut alhabet X, an outut alhabet Y and a robability distribution f (y x) that exresses the robability of observing symbol y given that symbol x is sent. Definition: The channel caacity of a discrete channel is: C = max f (x) I (X ; Y ) The caacity of a channel is the maximum of the mutual information of X and Y over all inut distributions f (x). Frank Keller Formal Modeling in Cognitive Science

Proerties of Proerties of Examle: Noiseless Binary Channel Assume a binary channel whose inut is reroduced exactly at the outut. Each transmitted bit is received without error: The channel caacity of this channel is: C = max I (X ; Y ) = bit f (x) This maximum is achieved with f () = and f () =. Examle: Binary Symmetric Channel Assume a binary channel whose inut is flied ( transmitted a or transmitted as ) with robability : The mutual information of this channel is bounded by: I (X ; Y ) = H(Y ) H(X Y ) = H(Y ) x f (x)h(y X = x) = H(Y ) x f (x)h() = H(Y ) H() H() The channel caacity is therefore: C = max I (X ; Y ) = H() bits f (x) Frank Keller Formal Modeling in Cognitive Science 5 Frank Keller Formal Modeling in Cognitive Science 6 Proerties of Proerties Proerties of A binary data sequence of length, transmitted over a binary symmetric channel with =.: Theorem: Proerties of C since I (X ; Y ) ; C log X, since C = max I (X ; Y ) max H(X ) log X ; 3 C log Y for the same reason. Frank Keller Formal Modeling in Cognitive Science 7 Frank Keller Formal Modeling in Cognitive Science 8

Proerties of of the Proerties of of the The noisy channel can be alied to decoding rocesses involving linguistic information. A tyical formulation of such a roblem is: we start with a linguistic inut I ; I is transmitted through a noisy channel with the robability distribution f (o i); the resulting outut O is decoded, yielding an estimate of the inut Î. I Noisy Channel f(o i) O I^ Alication Inut Outut f (i) f (o i) Machine translation Otical character recognition Part of seech tagging Seech recognition target source actual text text with mistakes POS seech signal target robability of POS translation of OCR errors f (w t) acoustic Frank Keller Formal Modeling in Cognitive Science 9 Frank Keller Formal Modeling in Cognitive Science Proerties of of the Examle Outut: Sanish English Proerties of Let s look at machine translation in more detail. Assume that the French text (F ) assed through a noisy channel and came out as English (E). We decode it to estimate the original French ( ˆF ): ^ F Noisy Channel E F f(e f) We comute ˆF using Bayes theorem: f (f )f (e f ) ˆF = arg max f (f e) = arg max f f f (e) = arg max f (f )f (e f ) f Here f (e f ) is the translation, f (f ) is the French, and f (e) is the English (constant). we all know very well that the current treaties are insufficient and that, in the future, it will be necessary to develo a better structure and different for the euroean union, a structure more constitutional also make it clear what the cometences of the member states and which belong to the union. messages of concern in the first lace just before the economic and social roblems for the resent situation, and in site of sustained growth, as a result of years of effort on the art of our citizens. the current situation, unsustainable above all for many self-emloyed drivers and in the area of agriculture, we must imrove without doubt. in itself, it is good to reach an agreement on rocedures, but we have to ensure that this system is not likely to be used as a weaon olicy. now they are also clear rights to be resected. i agree with the signal warning against the return, which some are temted to the intergovernmental methods. there are many of us that we want a federation of nation states. Frank Keller Formal Modeling in Cognitive Science Frank Keller Formal Modeling in Cognitive Science

Examle Outut: Finnish English Proerties of the raorteurs have drawn attention to the quality of the debate and also the need to go further : of course, i can only agree with them. we know very well that the current treaties are not enough and that in future, it is necessary to develo a better structure for the union and, therefore erustuslaillisemi structure, which also exressed more clearly what the member states and the union is concerned. first of all, kohtaamiemme economic and social difficulties, there is concern, even if growth is sustainable and the result of the efforts of all, on the art of our citizens. the current situation, which is unaccetable, in articular, for many carriers and resonsible for agriculture, is in any case, to be imroved. agreement on rocedures in itself is a good thing, but there is a need to ensure that the system cannot be used as a olitical lyomaaseena. they also have a clear icture of the rights of now, in which they have to work. i agree with him when he warned of the consenting to return to intergovernmental methods. many of us want of a federal state of the national member states. Definition: For two robability distributions f (x) and g(x) for a random variable X, the Kullback-Leibler divergence or relative entroy is given as: D(f g) = f (x) log f (x) g(x) x X The KL divergence comares the entroy of two distributions over the same random variable. Intuitively, the KL divergence number of additional bits required when encoding a random variable with a distribution f (x) using the alternative distribution g(x). Frank Keller Formal Modeling in Cognitive Science 3 Frank Keller Formal Modeling in Cognitive Science Theorem: Proerties of the D(f g) ; D(f g) = iff f (x) = g(x) for all x X ; 3 D(f g) D(g f ); I (X ; Y ) = D(f (x, y) f (x)f (y)). So the mutual information is the KL divergence between f (x, y) and f (x)f (y). It measures how far a distribution is from indeendence. Examle For a random variable X = {, } assume two distributions f (x) and g(x) with f () = r, f () = r and g() = s, g() = s: D(f g) = ( r) log r s + r log r s D(g f ) = ( s) log s r + s log s r If r = s then D(f g) = D(g f ) =. If r = and r = : D(f g) = log 3 D(g f ) = 3 log 3 + log + log =.75 =.887 Frank Keller Formal Modeling in Cognitive Science 5 Frank Keller Formal Modeling in Cognitive Science 6

Definition: For a random variable X with the robability distribution f (x) the cross-entroy for the robability distribution g(x) is given as: H(X, g) = x X f (x) log g(x) The cross-entroy can also be exressed in terms of entroy and KL divergence: H(X, g) = H(X ) + D(f g) Intuitively, the cross-entroy is the total number of bits required when encoding a random variable with a distribution f (x) using the alternative distribution g(x). Examle In the last lecture, we constructed a code for the following distribution using Huffman coding: x a e i o u f (x)...9.3.7 log f (x) 3.6.5 3.7.7 3.8 The entroy of this distribution is H(X ) =.995. Now comute the distribution g(x) = l(x) associated with the Huffman code: x a e i o u C(x) l(x) 3 g(x) 8 6 6 Frank Keller Formal Modeling in Cognitive Science 7 Frank Keller Formal Modeling in Cognitive Science 8 Summary Examle Then the cross-entroy for g(x) is: H(X, g) = x X f (x) log g(x) = (. log 8 +.3 log +.7 log 6 ) =.3 The KL divergence is: D(f g) = H(X, g) H(X ) =.35 +.9 log 6 +.3 log This means we are losing on average.35 bits by using the Huffman code rather then the theoretically otimal code given by the Shannon information. Frank Keller Formal Modeling in Cognitive Science 9 The noisy channel can the errors and loss when transmitting a message with inut X and outut Y ; the caacity of the channel is given by the maximum of the mutual information of X and Y ; a binary symmetric channel is one where each bit is flied with robability ; the noisy channel can be alied to linguistic roblems, e.g., machine translation; the Kullback-Leibler divergence is the distance between two distributions (the cost of encoding f (x) through g(x)). Frank Keller Formal Modeling in Cognitive Science