arxiv: v1 [cs.it] 5 Sep 2008

Similar documents
On Unique Decodability, McMillan s Theorem and the Expected Length of Codes

Chapter 2: Source coding

Lecture 4 Noisy Channel Coding

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

Information Theory CHAPTER. 5.1 Introduction. 5.2 Entropy

Entropy Rate of Stochastic Processes

Computing and Communications 2. Information Theory -Entropy

(each row defines a probability distribution). Given n-strings x X n, y Y n we can use the absence of memory in the channel to compute

ELEC546 Review of Information Theory

Lecture 4 : Adaptive source coding algorithms

Chapter 9 Fundamental Limits in Information Theory

Feedback Capacity of a Class of Symmetric Finite-State Markov Channels

(Classical) Information Theory III: Noisy channel coding

COMM901 Source Coding and Compression. Quiz 1

Entropy as a measure of surprise

Introduction to algebraic codings Lecture Notes for MTH 416 Fall Ulrich Meierfrankenfeld

1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H.

Lecture 8: Shannon s Noise Models

EECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have

Lecture 6 I. CHANNEL CODING. X n (m) P Y X

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

The Channel Capacity of Constrained Codes: Theory and Applications

Discrete Memoryless Channels with Memoryless Output Sequences

MATH Examination for the Module MATH-3152 (May 2009) Coding Theory. Time allowed: 2 hours. S = q

10-704: Information Processing and Learning Fall Lecture 9: Sept 28

On Scalable Coding in the Presence of Decoder Side Information

Shannon meets Wiener II: On MMSE estimation in successive decoding schemes

Tight Upper Bounds on the Redundancy of Optimal Binary AIFV Codes

IN this paper, we consider the capacity of sticky channels, a

Entropy and Ergodic Theory Lecture 3: The meaning of entropy in information theory

An Extended Fano s Inequality for the Finite Blocklength Coding

3F1 Information Theory, Lecture 3

Bounded Expected Delay in Arithmetic Coding

The Poisson Channel with Side Information

EE 4TM4: Digital Communications II. Channel Capacity

ITCT Lecture IV.3: Markov Processes and Sources with Memory

1 Introduction to information theory

A Single-letter Upper Bound for the Sum Rate of Multiple Access Channels with Correlated Sources

PART III. Outline. Codes and Cryptography. Sources. Optimal Codes (I) Jorge L. Villar. MAMME, Fall 2015

Shannon s noisy-channel theorem

Upper Bounds on the Capacity of Binary Intermittent Communication

An Alternative Proof of Channel Polarization for Channels with Arbitrary Input Alphabets

Basic Principles of Lossless Coding. Universal Lossless coding. Lempel-Ziv Coding. 2. Exploit dependences between successive symbols.

Lecture 5: Channel Capacity. Copyright G. Caire (Sample Lectures) 122

A Summary of Multiple Access Channels

lossless, optimal compressor

1590 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 6, JUNE Source Coding, Large Deviations, and Approximate Pattern Matching

Information Theory. Lecture 5 Entropy rate and Markov sources STEFAN HÖST

Data Compression. Limit of Information Compression. October, Examples of codes 1

Variable Length Codes for Degraded Broadcast Channels

3F1 Information Theory, Lecture 3

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

Coding of memoryless sources 1/35

On Scalable Source Coding for Multiple Decoders with Side Information

Codes for Partially Stuck-at Memory Cells

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

Distributed Functional Compression through Graph Coloring

ECE Advanced Communication Theory, Spring 2009 Homework #1 (INCOMPLETE)

Practical Polar Code Construction Using Generalised Generator Matrices

Intro to Information Theory

Information Theory. M1 Informatique (parcours recherche et innovation) Aline Roumy. January INRIA Rennes 1/ 73

Motivation for Arithmetic Coding

Lecture Notes on Digital Transmission Source and Channel Coding. José Manuel Bioucas Dias

A Formula for the Capacity of the General Gel fand-pinsker Channel

Aalborg Universitet. Bounds on information combining for parity-check equations Land, Ingmar Rüdiger; Hoeher, A.; Huber, Johannes

IN this paper, we study the problem of universal lossless compression

Chapter 2 Review of Classical Information Theory

The Continuing Miracle of Information Storage Technology Paul H. Siegel Director, CMRR University of California, San Diego

Superposition Encoding and Partial Decoding Is Optimal for a Class of Z-interference Channels

Cut-Set Bound and Dependence Balance Bound

Introduction to information theory and coding

Graph Coloring and Conditional Graph Entropy

Statistics 992 Continuous-time Markov Chains Spring 2004

1 Background on Information Theory

Block 2: Introduction to Information Theory

Approaching Blokh-Zyablov Error Exponent with Linear-Time Encodable/Decodable Codes

Lecture 11: Polar codes construction

Dispersion of the Gilbert-Elliott Channel

Representation of Correlated Sources into Graphs for Transmission over Broadcast Channels

Optimal Block-Type-Decodable Encoders for Constrained Systems

Lecture 5 Channel Coding over Continuous Channels

Intermittent Communication

Coding for Discrete Source

Lecture 7. Union bound for reducing M-ary to binary hypothesis testing

Noisy channel communication

On Common Information and the Encoding of Sources that are Not Successively Refinable

Turbo Compression. Andrej Rikovsky, Advisor: Pavol Hanus

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 12, DECEMBER

Amobile satellite communication system, like Motorola s

Exercises with solutions (Set B)

WITH advances in communications media and technologies

Joint Source-Channel Coding for the Multiple-Access Relay Channel

Homework Set #2 Data Compression, Huffman code and AEP

EE229B - Final Project. Capacity-Approaching Low-Density Parity-Check Codes

UNIT I INFORMATION THEORY. I k log 2

ELEMENT OF INFORMATION THEORY

Information Theory and Coding Techniques

Linear Codes, Target Function Classes, and Network Computing Capacity

Error Exponent Region for Gaussian Broadcast Channels

Transcription:

1 arxiv:0809.1043v1 [cs.it] 5 Sep 2008 On Unique Decodability Marco Dalai, Riccardo Leonardi Abstract In this paper we propose a revisitation of the topic of unique decodability and of some fundamental theorems of lossless coding. It is widely believed that, for any discrete source X, every uniquely decodable block code satisfies E[l(X 1X 2 X n)] H(X 1,X 2,...,X n), where X 1,X 2,...,X n are the first n symbols of the source, E[l(X 1X 2 X n)] is the expected length of the code for those symbols and H(X 1,X 2,...,X n) is their joint entropy. We show that, for certain sources with memory, the above inequality only holds when a limiting definition of uniquely decodable code is considered. In particular, the above inequality is usually assumed to hold for any practical code due to a debatable application of McMillan s theorem to sources with memory. We thus propose a clarification of the topic, also providing an extended version of McMillan s theorem to be used for Markovian sources. Index Terms Lossless source coding, McMillan s theorem, constrained sources, minimum expected code length. I. INTRODUCTION The problem of lossless encoding of information sources has been intensively studied over the years (see [1, Sec. II] for a detailed historical overview of the key developments in this field). Shannon initiated the mathematical formulation of the problem in his major work [2] and provided the first results on the average number of bits per source symbol that must be used asymptotically in order to represent an information source. For a random variable X with alphabet X and probability mass function p X ( ), he defined the entropy of X as the quantity H(X) = 1 p X (x)log p X (x) x X On another hand, Shannon focused his attention on finite state Markov sources X = {X 1,X 2,...}, for which he defined the entropy as 1 H(X) = lim n n H(X 1,X 2,...,X n ), a quantity that is now usually called entropy rate of the source. Based on these definitions, he derived the fundamental results for fixed length and variable length codes. In particular, he showed that, by encoding sufficiently large blocks of symbols, the average number of bits per symbol used by fixed length codes can be made as close as desired to the entropy rate of the source while maintaining the probability of error as small as desirable. If variable length codes are allowed, furthermore, he showed that the probability of error can be reduced to zero without increasing the asymptotically achievable average rate. Shannon also proved the converse theorem for the case of fixed length codes, but he did not explicitly consider the converse theorem for variable length codes (see [1, Sec. II.C]). The authors are with the Department of Electronics for Automation, University of Brescia, via Branze 38-25123, Brescia, Italy. Email: {marco.dalai, riccardo.leonardi}@ing.unibs.it An important contribution in this direction came from McMillan [3], who showed that every uniquely decodable code using a D-ary alphabet must satisfy Kraft s inequality, i D li 1, l i being the codeword lengths [4]. Based on this result, he was able to prove that the expected length of a uniquely decodable code for a random variable X is not smaller than its entropy, E[l(X)] H(X). This represents a strong converse result in coding theory. However, while the initial work by Shannon was explicitly referring to finite state Markov sources, McMillan s results basically considered only the encoding of a random variable. This leads to immediate conclusions on the problem of encoding memoryless sources, but an ad hoc study is necessary for the case of sources with memory. The application of McMillan s theorem to these type of sources can be found in [5, Sec. 5.4] and [6, Sec. 3.5]. In these two well-known references, McMillan s result is used not only to derive a converse theorem on the asymptotic average number of bits per symbol needed to represent an information source, but also to deduce a non-asymptotic strong converse to the coding theorem. In particular, the famous result obtained (see [6, Th. 3.5.2], [5, Th. 5.4.2], [7, Sec. II, p. 2047]) is that, for every source with memory, any uniquely decodable code satisfies E[l(X 1 X 2 X n )] H(X 1,X 2,...,X n ), (1) where X 1,X 2,...,X n are the first n symbols of the source, E[l(X 1 X 2 X n )] is the expected length of the code for those symbols and H(X 1,X 2,...,X n ) represents their joint entropy. In this paper we want to clarify that the above equation is only valid if a limiting definition of uniquely decodable code is assumed. In particular, we show that there are information sources for which a reversible encoding operation exists that produces a code for which equation (1) does not hold any longer for every n. This is demonstrated through a simple example in Section II. In Section III we revisit the topic of unique decodability, consequently providing an extension of McMillan s theorem for the case of first order Markov sources. Finally, in Section IV, some additional interesting remarks on the considered topic are made. II. A MEANINGFUL EXAMPLE Let X = {X 1,X 2,...} be a first order Markov source with alphabet X = {A, B, C, D} and with transition probabilities shown by the graph of Fig. 1. Its transition probability matrix is thus P = 0 0 0 0, where rows and columns are associated to the natural alphabetical order of the symbol values A,B,C and D. It is not difficult to verify that the stationary distribution associated with this transition probability matrix is the uniform distribution. LetX 1 be uniformly distributed, so that the source X is stationary and, in addition, ergodic. Let us now examine possible binary encoding techniques for this source and possibly find an optimal one. In order to

2 C A Fig. 1. Graph, with transition probabilities, for the Markov source use in the example. evaluate the performance of different codes we determine the entropy of the sequences of symbols that can be produced by this source. By stationarity of the source, one easily proves that n H(X 1,X 2,...,X n ) = H(X 1 )+ H(X i X i 1 ) B i=2 = 2+ 3 2 (n 1), where H(X i X i 1 ) is the conditional entropy of X i given X i 1, that is H(X i X i 1 ) = 1 p XiX i 1 (x,y)log p Xi X i 1 (x y). x,y X Let us now consider the following binary codes to represent sequences produced by this source. Classic code We call this first code classic as it is the most natural way to encode the source given its particular structure. Since the first symbol is uniformly distributed between four choices, 2 bits are used to uniquely identify it, in an obvious way. For the next symbols we note that we always have dyadic conditional probabilities. So, we apply a state-dependent code. For encoding the k-th symbol we use, again in an obvious way, 1 bit if symbol k 1 was an A or a B, and we use 2 bits if symbol k 1 was a C or a D. This code seems to perfectly fulfill the source as the number of used bits always corresponds to the uncertainty. Indeed, the average length of the code for the first n symbols is given by n E[l(X 1,X 2,...,X n )] = E[l(X 1 )]+ E[l(X i )] D i=2 = 2+ 3 2 (n 1). So, the expected number of bits used for the first n symbols is exactly the same as their entropy, which would let us declare that this encoding technique is optimal. Alternative code Let us consider a different code, obtained by applying the following fixed mapping from symbols to bits:a 0,B 1, C 01, D 10. It will be easy to see that this code maps different sequences of symbols into the same codeword. For example, the sequences AB and C are both coded to 01. This is usually expressed, see for example [5], by saying that the code is not uniquely decodable, an expression which suggests the idea that the code cannot be inverted, different sequences being associated to the same code. It is however easy to notice that, for the source considered in this example, the code does not introduce any ambiguity. Different sequences that are producible by the source are in fact mapped into different codes. Thus it is possible to decode any sequence of bits without ambiguity. For example the code 01 can only be produced by the single symbol C and not by the sequence AB, since our source cannot produce such sequence (the transition from A to B being impossible). It is not difficult to verify that it is indeed possible to decode any sequence of bits by operating in the following way. Consider first the case when there are still two or more bits to decode. In such a case, for the first pair of encountered bits, if a 00 (respectively a 11) is observed then clearly this corresponds to an A symbol followed by a code starting with a 0 (respectively a B symbol followed by a code starting with a 1). If, instead, a 01 pair is observed (respectively a 10) then a C must be decoded (respectively a D). Finally, if there is only one bit left to decode, say a 0 or a 1, the decoded symbol is respectively an A or a B. Such coding and decoding operations are summarized in Table I. Now, what is the performance of this code? The expected number of bits in coding the first n symbols is given by: n E[l(X 1 X 2 X 3 X n )] = E[l(X i )] i=1 = 3 2 n Unexpectedly, the average number of bits used by the code is strictly smaller than the entropy of the symbols. So, the performance of this code is better than what would have been traditionally considered the optimal code, that is the classical code. Let us mention that this code is not only more efficient on average, but it is at least as efficient as the classic code for every possible sequence which remains compliant with the source characteristics. For each source sequence, indeed, the number of decoded symbols after reading the first m bits of the alternative code is always larger than or equal to the number of symbols decoded with the first m bits of the classic Encoding Decoding more bits left A 0 B 1 C 01 D 10 00... A+0... 01... C... 10... D... 11... B +1... one bit left 0 A 1 B TABLE I TABLE OF ENCODING AND DECODING OPERATIONS OF THE PROPOSED ALTERNATIVE CODE FOR THE MARKOV SOURCE OF FIGURE 1.

3 code. Hence, the proposed alternative code is more efficient than the classic code in all respects. The obtained gain per symbol obviously goes to zero asymptotically, as imposed by the Asymptotic Equipartition Property. However, in practical cases we are usually interested in coding a finite number of symbols. Thus, this simple example reveals that the problem of finding an optimal code is not yet well understood for the case of sources with memory. The obtained results may thus have interesting consequences not only from a theoretical point of view, but even for practical purposes in the case of sources exibiting constraints imposing high order dependencies. Commenting on the alternative code, one may object that it is not fair to use the knowledge on impossible transitions in order to design the code. But probably nobody would object to the design of what we called the classic code. Even in that case, however, the knowledge that some transitions are impossible was used, in order to construct a state-dependent optimal code. It is important to point out that we have just shown a fixed to variable length code for a stationary ergodic source that maps sequences of n symbols into strings of bits that can be decoded and such that the average code length is smaller than the entropy of those n symbols. Furthermore, this holds for everyn, and not for an a priori fixedn. In a sense we could say that the given code has a negative redundancy. Note that there is a huge difference between the considered setting and that of the so called one-to-one codes (see for example [8] for details). In the case of one-to-one codes, it is assumed that only one symbol, or a given known amount of symbols, must be coded, and codes are studied as maps from symbols to binary strings without considering the decodability of concatenation of codewords. Under those hypotheses, Wyner [9] first pointed out that the average codeword length can always be made lower than the entropy, and different authors have studied bounds on the expected code length over the years [10], [11]. Here, instead, we have considered a fixed-to-variable length code used to compress sequences of symbols of whatever length, concatenating the code for the symbols one by one, as in the most classic scenario. III. UNIQUE DECODABILITY FOR CONSTRAINED SOURCES In this section we briefly survey the literature on unique decodability and we then propose an adequate treatment of the particular case of constrained sources defined as follows. Definition 1: A source X = {X 1,X 2,...} with symbols in a discrete alphabet X is a constrained source if there exists a finite sequence of symbols from X that cannot be obtained as output of the source X. A. Classic definitions and revisitation It is interesting to consider how the topic of unique decodability has been historically dealt with in the literature and how the results on unique decodability are used to deduce results on the expected length of codes. Taking [6] and [5] as representative references for what can be viewed as the classic approach to lossless source coding, we note some common structures between them in the development of the theory, but also some interesting differences. The most important fact to be noticed is the use, in both references with only marginal differences, of the following chain of deductions: (a) McMillan s theorem asserts that all uniquely decodable codes satisfy Kraft s inequality; (b) If a code for a random variable X satisfies Kraft s inequality, then E[l(X)] H(X); (c) Thus any uniquely decodable code for a random variable X satisfies E[l(X)] H(X); (d) For sources with memory, by considering sequences of n symbols as super-symbols, we deduce that any uniquely decodable code satisfies E[l(X 1,X 2,...,X n )] H(X 1,X 2,...,X n ). In the above flow of deductions there is an implicit assumption which is not obvious and, in a certain way, not clearly supported. It is implicitly assumed that the definition of uniquely decodable code used in McMillan s theorem is also appropriate for sources with memory. Of course, by definition of definition, one can freely choose to define uniquely decodable code in any preferred way. However, as shown by the code of Table I in the previous section, the definition of uniquely decodable code used in McMillan s theorem does not coincide with the intuitive idea of decodable for certain sources with memory. To our knowledge, this ambiguity has never been reported previously in the literature, and for this reason it has been erroneously believed that the result E[l(X 1,X 2,...,X n )] H(X 1,X 2,...,X n ) holds for every practically usable code. As shown by the Markov source example presented, this interpretation is incorrect. In order to better understand the confusion associated to the meaning of uniquely decodable code, it is interesting to focus on a small difference between the formal definitions given by the authors in [5] and in [6]. We start by rephrasing for notational convenience the definition given by Cover and Thomas in [5]. Definition 2: [5, Sec. 5.1, pp. 79-80] A code is said to be uniquely decodable if no finite sequence of code symbols can be obtained in two or more different ways as a concatenation of codewords. Note that this definition is the same used in McMillan s paper [3], and it considers a property of the codebook without any reference to sources. It is however difficult to find a clear motivation for such a source independent definition. After all, a code is always designed for a given source, not for a given alphabet. Indeed, right after giving the formal definitions, the authors comment In other words, any encoded string in a uniquely decodable code has only one possible source string producing it. So, a reference to sources is introduced. What is not noticed is that the condition given in the formal definition coincides with the phrased one only if the source at hand can produce any possible combination of symbols as output. Conversely, the two definitions are not equivalent, the first one being stronger, the second one being instead more intuitive.

4 With respect to formal definitions, Gallager proceeds in a different way with the following: Definition 3: [6, Sec. 3.2, pg. 45] A code is uniquely decodable if for each source sequence of finite length, the sequence of code letters corresponding to that source sequence is different from the sequence of code letters corresponding to any other source sequence. Note that this is a formal definition of unique decodability of a code with respect to a given source. Gallager states this definition while discussing memoryless sources 1. In that case, the definition is clearly equivalent to Definition 2 but, unfortunately, Gallager implicitly uses Definition 2 instead of Definition 3 when dealing with sources with memory. 2 In order to avoid the above discussed ambiguity, we propose to adopt the following explicit definition. Definition 4: A code C is said to be uniquely decodable for the source X if no two different finite sequences of source symbols producible by X have the same code. With this definition, not all uniquely decodable codes for a given source satisfy Kraft s inequality. So, the chain of deductions (a)-(d) listed at the beginning of this section cannot be used for constrained sources, as McMillan s theorem uses Definition 2 of unique decodability. The alternative code of Table I thus immediately gives: Lemma 1: There exists at least one sourcex and a uniquely decodable code for X such that, for every n 1, E[l(X 1,X 2,...,X n )] < H(X 1,X 2,...,X n ). B. Extension of McMillan s theorem to Markov sources In Section II, the proposed alternative code demonstrates that McMillan s theorem does not apply in general to uniquely decodable codes for a constrained source X as defined in Definition 4. In this section a modified version of Kraft s inequality is proposed which represents a necessary condition for the unique decodability of a code for a first order Markov source. Let X be a Markov source with alphabet X = {1,2,...,m} and transition probability matrix P. Let W = {w 1,w 2,...,w m } be a set of D-ary codewords for the alphabet X and let, l i = l(w i ) be the length of codeword w i. McMillan s original theorem can be stated in the following way: Theorem 1 (McMillan, [3]): If the set of codewords W is uniquely decodable (in the sense of Definition 2) then m D li 1. i=1 We propose a modified theorem for considering the unique decodability for the specific source. 1 See [6, pg. 45] We also assume, initially, [...] that successive letters are independent 2 In fact, in [6], the proof of Theorem 3.5.2, on page 58, is based on Theorem 3.3.1, on page 50, the proof of which states:...follows from Kraft s inequality, [...] which is valid for any uniquely decodable code. But Kraft s inequality is valid for uniquely decodable codes defined as in Definition 2 and not Definition 3. Theorem 2: If the set of codewords W is uniquely decodable for the Markov source X, then the matrix Q defined by { 0 if P ij = 0 Q ij = D lj if P ij > 0 has spectral radius at most 1. Proof: The proof is very similar to Karush s proof of McMillan s theorem [12]. Let X (k) be the set of all sequences of k symbols that can be produced by the source and let L = [D l1,d l2,...,d lm ]. For k > 0, define the row vector V (k) = L Q k 1. It is easy to see by induction that the i-th component of V (k) can be written as V (k) i = h 1,h 2,...,h k D lh1 lh2 lhk where the sum runs over all sequences of indices (h 1,h 2,...,h k ) with varying h 1,h 2,...,h k 1 and h k = i such that (h 1,h 2,...,h k ) X (k). So, calling 1 m the vector composed of m 1 s, we have L Q k 1 1 m = D lh1 lh2 lhk. (h 1,h 2,...,h k ) X (k) Reindexing the sum with respect to the total length r = l h1 + l h2 + +l hk and calling N(r) the number of sequences of X (k) which are mapped in a length r code, we have L Q k 1 1 m = kl max r=1 N(r)D r where l max is the maximum of the values l i,i = 1,2,...,m. Since the code is uniquely decodable for the source X, there are at most D r source-compatible sequences with a code of length r, that is, N(r) D r. Hence, for every k > 0 kl max L Q k 1 1 m D r D r = kl max (2) r=1 Now, note that the irreducible matrix Q is also nonnegative. Thus, by the Perron-Frobenius theorem (see [13] for details), its spectral radius ρ(q) is also an eigenvalue, with algebraic multiplicity 1 and with positive associated left eigenvector. Suppose now ρ(q) > 1. Since L and 1 m are both positive, it is easy to deduce that the term on the left hand side of equation (2) asymptotically grows as ρ(q) k 1 when k goes to infinity. On the contrary, the right hand side term only grows linearly with k and, for large enough k, equation (2) cannot hold. We conclude that ρ(q) 1. IV. SOME ADDITIONAL REMARKS Remark 1 (Theorem 2 generalizes Theorem 1): In the case of unconstrained Markov sources, Theorem 2 is equivalent to Theorem 1. Indeed, the Markov source being not constrained means that its transition probability matrix P has all strictly positive entries. This implies that the matrix Q defined in Theorem 2 has all equal rows. The spectral radius of such

5 a matrix equals the sum of the elements in every row, which is j D lj, reducing thus to the classic Kraft s inequality. Remark 2 (Non sufficiency of the condition): Kraft s inequality is both a necessary and sufficient condition for the existence of a uniquely decodable code (in the sense of Definition 2) with codeword lengths l i. Theorem 2, instead, only gives a necessary condition on the lengths l i for the unique decodability of a code for a given source. It is easy to show that condition stated in the theorem is not a sufficient condition for the existence of a uniquely decodable code for a source with codeword lengths l i. Finding a necessary and sufficient condition seems to be a much harder problem. Remark 3 (Extended Sardinas-Patterson test): With respect to the previous remark, we point out that it is however possible to test a given code for decodability for a given source by devising a generalization of the Sardinas-Patterson test [14] to deal with constrained sources (see [15]). Remark 4 (A more general form of Theorem 2): Theorem 2 was formulated for the case of Markov chains in the Moore form, as considered for example in [5]. In other words, we have modeled information sources as Markov chains by assigning an output source symbol to every state. In order to deal with more general sources we can consider Markov sources in the Mealy form, where output symbols are not associated to states but to transitions between states (which corresponds to the Markov source model used by Shannon in [2] or, for example, by Gallager in [6]). Theorem 2 can be extended to this type of Markov sources as follows (see [15]). Theorem 3: Let X be a finite state source, with possible states S 1,S 2,...,S q and with output symbols in the alphabet X = {1,2,...,m}. Let W = {w 1,...,w m } be a set of codewords for the symbols in X with lengths l 1,l 2,...,l m. Let O i,j be the subsets of X of possible symbols output by the source when transiting from state S i to state S j, O ij being the empty set if transition from S i to S j is impossible. If the code is uniquely decodable for the source X, then the matrix Q defined by Q ij = h O i,j D lh has spectral radius at most 1. Remark 5 (Shannon s insight): An historical analysis reveals that both McMillan s theorem and the proposed generalized one in the form of Theorem 3 are mathematically equivalent to a formulation obtained by Shannon already in [2, Part I, Sec. 1] for the evaluation of the capacity of discrete noiseless channels. In particular, in [2] Shannon established that the capacity of an unconstrained noiseless channel with symbol durationst 1,t 2,...,t m is given by the value logx 0, where X 0 is the largest real solution of the difference equation Furthermore, Shannon generalized the capacity formula to the case of noiseless finite state channels, by stating the following [2, Th. 1]: Theorem 4 (Shannon, [2]): Let b (s) ij be the duration of the s th symbol which is allowable in state i and leads to state j. Then the channel capacityc is equal to logw 0 where W 0 is the largest real root of the determinant equation: s W b(s) ij δ ij = 0. As for the unconstrained case, it is possible to show that Theorem 3 is equivalent to the statement that every finite state D-ary channel has capacity at most logd. REFERENCES [1] S. Verdú, Fifty years of shannon theory, IEEE Trans. on Inform. Theory, vol. 44, no. 6, pp. 2057 2078, 1998. [2] C. E. Shannon, A mathematical theory of communication, Bell Sys. Tech. Journal, vol. 27, pp. 379 423,623 656, 1948. [3] B. McMillan, Two inequalities implied by unique decipherability, IEEE Trans. Inform. Theory, vol. IT-2, pp. 115 116, 1956. [4] L.G. Kraft, A device for quanitizing, grouping and coding amplitude modulated pulsese, M.S. thesis, Dept. of Electrical Engineering, MIT, Cambridge, Mass., 1949. [5] T.M. Cover and J.A. Thomas, Elements of Information Theory, John Wiley, New York, 1990. [6] R.G. Gallager, Information Theory and Reliable Communication, Wiley, New York, 1968. [7] A. D. Wyner, J. Ziv, and A. J. Wyner, On the role of pattern matching in information theory, IEEE Trans. on Inform. Theory, vol. 44, no. 6, pp. 2045 2056, 1998. [8] N. Alon and A. Orlitsky, A lower bound on the expected length of oneto-one codes, IEEE Trans. on Inform. Theory, vol. 40, pp. 1670 1772, 1994. [9] A. D. Wyner, An upper bound on the entropy series, IEEE Trans. on Inform. Theory, vol. 20, pp. 176 181, 1972. [10] C. Blundo and R. De Prisco, New bounds on the expected length of one-to-one codes, IEEE Trans. on Inform. Theory, vol. 42, no. 1, pp. 246 250, 1996. [11] S. A. Savari and A. Naheta, Bounds on the expected cost of one-to-one codes, in Proc. IEEE Intern. Symp. on Inform. Theory, 2004, p. 92. [12] J. Karush, A simple proof of an inequality of McMillan, IRE Trans. Inform. Theory, vol. IT-7, pp. 118, 1961. [13] H. Minc, Nonnegative Matrices, Wiley, 1988. [14] A.A. Sardinas and G.W. Patterson, A necessary and sufficient condition for the unique decomposition of coded messages, in IRE Convention Record, Part 8, 1953, pp. 104 108. [15] M. Dalai and R. Leonardi, On unique decodability, McMillan s theorem and the expected length of codes, University of Brescia, Technical Report R.T. 2008-01-58, 2008. X t1 +X t2 + +X tm = 1. It is not difficult to show that McMillan s theorem is equivalent to the obvious statement that the capacity of a D-ary channel is at most logd.