ERGEBNISSE DER MATHEMATIK UND IHRER GRENZGEBIETE. UNTER MITWIRKUNG DER SCHRIFTLEITUNG DES "ZENTRALBLATT FüR MATHEMATIK" HERAUSGEGEBEN VON

Size: px

Start display at page:

Download "ERGEBNISSE DER MATHEMATIK UND IHRER GRENZGEBIETE. UNTER MITWIRKUNG DER SCHRIFTLEITUNG DES "ZENTRALBLATT FüR MATHEMATIK" HERAUSGEGEBEN VON"

Kerry Mathews
6 years ago
Views:

1 ERGEBNISSE DER MATHEMATIK UND IHRER GRENZGEBIETE UNTER MITWIRKUNG DER SCHRIFTLEITUNG DES "ZENTRALBLATT FüR MATHEMATIK" HERAUSGEGEBEN VON P. R. HALMOS R. REMMERT. B. SZÖKEFALVI-NAGY UNTER MITWIRKUNG VON L. V. AHLFORS. R.BAER. F.L.BAUER. R.COURANT. A. DOLD j.l.doob S.EILENBERG M.KNESER H.RADEMACHER B. SEGRE. E. SPERNER REDAKTION P. R. HALMOS ====== NEUE FOLGE BAND 31 REIHE: WAHRSCHEINLICHKEITSTHEORIE UND MATHEMATISCHE STATISTIK BESORGT VON J. L. DOOB SPRINGER-VE RLAG BERLIN. GÖTTINGEN. HEIDELBERG. NEW YORK 1964

2 CODING THEOREMS OF INFORMATION THEORY BY J. WOLFOWITZ PROFESSOR OF MATHEMATICS, CORNELL UNIVERSn'y SECOND EDITION SPRINGER-VERLAG BERLIN. GÖTTINGEN. HEIDELBERG. NEW YORK 1964

3 Alle Rechte, insbesondere das der Übersetzung in fremde Sprachen, vorbehalten Ohne ausdrückliche Genehmigung des Verlages ist es auch nicht gestattet, dieses Buch oder Teile- daraus auf photomechanischem Wege (Photokopie, Mikrokopie) oder auf andere Art zu vervielfältigen ISBN ISBN (ebook) DOI / by Springer-Verlag, Berlin. Göttingen - Heidelberg 1961 und 1964 Library 01 Congress Catalog Card Number Softcover reprint of the hardcover 2nd edition 1964 Titel Nr.4575

4 DEDICATED TO THE MEMORY OF ABRAHAM WALD

5 Preface to the Second Edition The imminent exhaustion of the first printing of this monograph and the kind willingness of the publishers have presented me with the opportunity to correct a few minor misprints and to make a number of additions to the first edition. Some of these additions are in the form of remarks scattered throughout the monograph. The principal additions are Chapter 11, most of Section 6.6 (inc1uding Theorem 6.6.2), Sections 6.7, 7.7, and 4.9. It has been impossible to inc1ude all the novel and interesting results which have appeared in the last three years. I hope to inc1ude these in a new edition or a new monograph, to be written in a few years when the main new currents of research are more clearly visible. There are now several instances where, in the first edition, only a weak converse was proved, and, in the present edition, the proof of a strong converse is given. Where the proof of the weaker theorem employs a method of general application and interest it has been retained and is given along with the proof of the stronger result. This is wholly in accord with the purpose of the present monograph, which is not only to prove the principal coding theorems but also, while doing so, to acquaint the reader with the most fruitful and interesting ideas and methods used in the theory. I am indebted to Dr. SAMUEL KOTZ for devoted and valuable help in checking my revisions and for constructing the index, and to Professor H. P. BEARD for intelligent and conscientious reading of proofs. As earlier, I am grateful to the Air Force Office of Scientific Research and the Office of Naval Research for continned support; it is a pleasure to add the name of R. J. LUNDEGARD to the earlier list of their staff members to whom my thanks are due. Comell University, July, 1964 J. WOLFOWITZ

6 Preface to the First Edition This monograph originated with a course of lectures on information theory which I gave at comen University du ring the academic year It has no pretensions to exhaustiveness, and, indeed, no pretensions at an. Its purpose is to provide, for mathematicians of some maturity, an easy introducton to the ideas and principal known theorems of a certain body of co ding theory. This purpose will be amply achieved if the reader is enabled, through his reading, to read the (sometimes obscurely written) literature and to obtain results of his own. The theory is obviously in a rapid stage of development; even while this monograph was in manuscript several of its readers obtained important new results. The first chapter is introductory and the subject matter of the monograph is described at the end of the chapter. There does not seem to be a uniquely determined logicalorder in which the material should be arranged. In determining the final arrangement I tried to obtain an order which makes reading easy and yet is not illogical. I can only hope that the resultant compromises do not eam me the criticism that I failed on both counts. There are a very few instances in the monograph where astated theorem is proved by a method which is based on a result proved only later. This happens where the theorem fits in one place for completeness and the method of proof is based on a theorem which fits in elsewhere. In such cases the proof is given with the theorem, and the reader, duly wamed, can come back to the proof later. This procedure, which occurs very rarely, will surely cause the reader no trouble and can be blamed on the compromises described above. This monograph certainly contains many errors. However, I hope that the errors still left are not so great as to hinder easy reading and comprehension. My gratitude is due to several friends and 'colleagues: HELEN P. BEARD, L. J. COTE, J. H. B. KEMPERMAN, and J. KIEFER. They read the manuscript or parts of it and made suggestions which were always valued though not always accepted. The Air Force Office of Scientific Research and the Office of Naval Research subsidized my work at various times and I am grateful to them; M. M. ANDREw, DOROTHY M. GILFORD, R. G. POHRER, and O. A. SHAW of their staffs have always been cooperative and helpful. The invitation to publish in the Ergebnisse series came from J. L. DOOB, to whom my thanks are due. comen University, September J. W OLFOWITZ

7 Contents 1. Heuristic Introduction to the Discrete Memoryless Channel Combinatorial Preliminaries Generated sequences Properties of the entropy function 10 Remarks The Discrete Memoryless Channel Description of the channel A coding theorem The strong converse Strong converse for the binary symmetric channel The finite-state channel with state calculable by both sender and receiver The finite-state channel with state calculable only by the sender Remarks Compound Channels Introduction The canonical channel A coding theorem Strong converse Compound d.m.c. with c.p.f. known only to the receiver or only to the sender Channels where the c.p.f. for each letter is stochastically determined Proof of Theorem The d.m.c. with feedback Strong converse for the d.m.c. with feedback 51 Remarks The Discrete Finite-Memory Channel The discrete channel The discrete finite-memory channel The coding theorem for the d.f.m.c Strong converse of the coding theorem for the d.f.m.c Rapidity of approach to C in the d.f.m.c Discussion of the d.f.m.c Remarks Discrete Channels with a Past History Preliminary discussion Channels with a past history Applicability of the coding theorems of Section 7.2 to channels with a past history

8 x Contents A channel with infinite duration of memory of previously transmitted letters A channel with infinite duration of memory of previously received letters Indecomposable channels The power of the memory. Remarks General Discrete Channels Alternative description of the general discrete channel The method of maximal codes The method of random codes Weak converses Digression on the d.m.c Discussion of the foregoing Channels without a capacity 107 Remarks The Semi-Continuous Memoryless Channel Introduction Strong converse of the coding theorem for the s.c.m.c Proof of Lemma The strong converse with 0 (vn) in the exponent 121 Remarks Continuous Channels with Additive Gaussian Noise A continuous memoryless channel with additive Gaussian noise Message sequences within a suitable sphere Message sequences on the periphery of the sphere or within a shell adjacent to the boundary Another proof of Theorems and Remarks Mathematical Miscellanea Introduction The asymptotic equipartition property Admissibility of an ergodic input for a discrete finite-memory channel Group Codes. Sequential Decoding Group Codes Canonical form of the matrix M Sliding parity check codes SequentiaI decoding. 146 Remarks References Index... List of ChanneIs Studied or Mentioned

9 1. Heuristic Introduction to the Discrete Memoryless Channel The spirit of the problems discussed in the present monograph can already be gleaned from a consideration of the discrete memoryless channel, to a heuristic discussion of which the present chapter is devoted. In this discussion there will occur terms not yet precisely defined, to which the reader should give their colloquial meaning. This procedure is compatible with the purpose of the present chapter, which is to motivate the problems to be discussed later in the book, and not to carry forward the theory logically. The reader scomful of such unmathematical behavior or in no need of motivation may proceed at on ce to Chapter 2 without any lass of logical continuity. Such definitions as are given in the present chapter will be repeated later. We suppose that a stream of symbols, each of which is a letter of an alphabet, is being sent over a discrete, noisy, memoryless channel. "Discrete" (a term of engineering origin) means here that the alphabet of the letters being sent, and the alphabet of the letters being received (which need not be the same) each have a finite number (> 1) of symbols. We take these two numbers to be the same, say a. Since the actual symbols will not matter we may assurne, and do, that both alphabets consist of the numbers (letters) 1,..., a. (It is easy to admit the possibility that one alphabet may have fewer symbols than the other. In that case ais the larger number of symbols, and the modifications needed in the theory below will be trivially obvious.) "Noisy" means that the letter sent may be garbled in transmission by noise (= chance error). Let the probability that the letter (number) j will be received when the letter i is sent be w (j / i). Ofcourse w(l/i)+... +w(a/i)=l, i=l,...,a. "Memoryless" means that the noise affects each letter independently of all other letters. We shall assurne that all the "words" we send have n letters. There is no reason for this assumption except that the theory has been developed under it, and that its elimination would make things more difficult. In ordinary writing, where the words of a language have different length, the word is determined by the spaces which separate it from its neighbors. 1 Ergebn. d. Mathem., N. F., Bd. 31, Wolfowitz, 2. Auf!.

10 2 1. Heuristic Introduction to the Discrete Memoryless Channel Such a "space" is really a letter of the alphabet. In a theory such as the one to be studied below, one seeks the most efficient use, in a certain sense, of the letters of the alphabet to transmit the words of a language. It is not at all certain that setting aside one letter of the alphabet to serve as a "space" is really the most efficient use of the letter (this even ignoring the fact that, because of noise, a "space" may not be received as a "space"). If one letter is not to serve to separate one word from the next then this will have to be done 'bythe code (definition later), a requirement which would enormously complicate the problem. Thus we assurne that. all the words sent are sequences of n integers, each integer one of 1,..., a. (Any such sequence will be called an n-sequence.) Asymptotically (with n) this requirement willmake no difference. Suppose we have a dictionary of N words in a certain language. Here language could, for example, mean what is colloquially meant by a language, or it could mean the totality of words in a certain military vocabulary, or in a given diplomatie code book, or in a prescribed technical handbook. The words in the dictionary are of course what are colloquially known as words; they are written in the alphabet of the language (if the latter is English, for example, its alphabet is not the alphabet 1,..., a of the letters sent over the channel) and they are not in general n-sequences or even of the same length (certainly not if the' language is English). We wish to transmit (synonym for "send") any of these words in any arbitrary order and with any arbitrary frequency over the channel. If there were no noise we would proceed as follows:. Let n be the smallest integer such that an > N. Construct in'any manner a one-to-one correspondence between the words of the dictionary and N of the an n-sequences, this correspondence of course known to both sender and receiver. When one wishes to send any word in the dictionary one sends the n-sequence which corresponds to it. The receiver always receives the n-sequence correctly, and hence knows without error the word being sent. When there is noise we must expect that in general there will be a positive probability that an 1!-sequence sent will be incorrectly received, i.e., received as another sequence (error of transmission). For example, when w (j I i) > 0 for every i, j = 1,..., a, this will surely be the case. We would like then to be able to send over the channel any of the N words of the dictionary (more properly, any of the N n-sequences which correspond to the N words of the dictionary), with a prob ability < A, say, that any word of the dictionary (or rat her n-sequence corresponding to it) sent will be incorrectly understood by the receiver. (Of course it is in general impossible that A = o. We shall be satisfied if A > 0 is suitably small, and shall henceforth always assurne, unless the contrary is expli-

11 1. Heuristic Introduction to the Discrete Memoryless Channel 3 citly stated, that Ä. > 0.) To achieve this we must proceed approximately as follows: Choose an n sufficiently large, for which an > N. There are of course an n-sequences. Establish a one-to-one correspondence between the N words of the dictionary and a properly selected set of N of the an sequences. The latter will be chosen so that the "distance" between any two of the selected sequences (in so me sense yet to be established) is "sufficiently large" (the sense of this must also be established) ; it is to permit this that n must be sufficiently large. When one wishes to send any ward of the dictionary one sends the n-sequence corresponding to it. The receiver operates according to the following rule: He always infers that the n-sequence actually sent is that one, of the N n-sequences which correspond to the N words of the dictionary, which is "nearest" to the n-sequence received ("resembles" it most). If the probability is > 1 - Ä. that, when any of the N n-sequences which correspond to wards of the dictionary is sent, the n-sequence actually received is nearer to the one sent than to any of the other N - 1 n sequences, the probkm would be solved. To summarize then, the problem is to choose N n-sequences in such a way that, whichever of the N n-sequences is sent, the receiver can correctly infer which one has been sent with probability > 1 - A. The solution is, roughly, to choose n sufficiently large so that the N n-sequences can be embedded in the space of all an n-sequences with enough distance between any two that, when any one of the N n-sequences is sent, the set of all n-sequences nearer to it than to any of the other N - 1 n-sequences, has probability > 1 - A of being received. We would like to emphasize that we have used the words "distance" and "nearest" in a very vague sense, and that they should not be taken literally. More generally, the problem is to construct a "code", as follows: a) to choose N among the an possible transmitted n-sequences as the sequences u I '..., UN to corre~pond to the words of the dictionary b) to divide up the an possible received n-sequences into N disjoint sets Al'..., AN such that, when u i is sent, the probability that the received sequence williie in Ai is > 1-A, i = 1,..., N. Then, when the receiver receives a sequence in Ai he decides that Ui was sent. We shall now show that a solution can actually be effected with very little trouble. To simplify the discussion we shall assurne that a = 2 (the general result will be obvious anyhow) and that w(1 /1) =F w(l / 2) (hence w(2 11) =F w(2 / 2); the explanation for this assumption will be given later in order not to cause a digression here. Let k be any integer, say the smallest, such that 2 k > N. Set up a one-to-one correspondence between the N words of the dictionary and any N k-sequences in any manner; henceforth we identify the wards with these k-sequences. In order to transmit any k-sequence so that the probability of its correct 1*

12 4 1. Heuristic Introduction to the Discrete Memoryless Channel reception is > 1 - A we simply send the same sequence m times consecutively, where m is a suitable integer which depends upon k and A (and of course w (. I.)) and whose existence will shortly be demonstrated. Thus n = km. The idea of the method by which the receiver decides upon which sequence has been sent may be described simply as follows: The receiver knows that a letter b (unknown to hirn) was sent (not consecutively) m times and received as b l,.. " b m, say. He decides that m b = 1 if II w(b;ll) > II w(b;12), ;~l i~l m b = 2 if 11 w(b i I1) < II w(bii2). i~l i~l (In statistics this is called the maximum likelihood method.) If the above products are equal he may make any decision he pleases. We shall now prove that one can choose m so that, if the letter c is sent (c = 1,2), then, callingthe other letter c', the probability of all points b = (bi'"., b m) such that m II w (bil c) > 11 w (bil c') i~l i~l I is greater than (1 - A) k. Thus the prob ability that. all k letters are correctly received (inferred) is > 1 - A. (It would actually be better to apply the maximum likelihood method to the entire k-sequence than to each letter separately, but in an introduction such as this there is no need to dwell on this point.) First we note that, for c, c' = 1, 2, and c =1= c', 2 2 h(c) = ~w(ilc) logw(ilc)- ~w(ilc)logw(ilc') > O. i~l To prove this maximize m i~l m m 2 ~ W (i I c) log Jri i~l with respect to Jrl and Jr2, subject to the restriction Jrl + Jr2 = 1. The usual Lagrange multiplier argument gives that a unique maximum exists at Jri = w (i I c), i = 1, 2. Since, by our assumption, w (i I c) =1= =1= w (i I c'), i = 1, 2, the desired result follows. Now let h = min(h(1), h(2)) > O. Let Zv"" Zm be independent chance variables with the common distribution function P{ZI = i} = w(i I c), i = 1, 2

13 1. Heuristic Introduction to the Discrete Memoryless Channel 5 where c is 1 or 2. Let c' =F c be 1 or 2. From the preceding paragraph we have1 E [log W(ZI I c) -log W(ZI I c')] > h> O. Hence, for m > m(c) sufficiently large we have by the law of large numbers. Now let m = max(m(l), m(2)), and the desired result is evident. We now digress for a moment to discuss the reason for the assumption that w(lll) =F w(112). If w(lll) = w(l!2) and hence w(2il) = w(212), the distribution of the received letter is the same no matter which letter is sent. Thus it is impossible to infer from the letter received what letter has been sent; for any k there is essentially only one k-sequence which can be sent. Such a channel is clearly trivial and of no interest. Returning to the problem under consideration, we have seen that, for a = 2, and this could also be shown generally, a: trivial solution exists. To make the problem interesting and more realistic we ask how small an n will suffice to enable us to transmit N words with probability of error < A. This problem and two equivalent alternative versions may be listed as follows for future reference: Form I). Given N and A, how small an n will suffice? Form II). Given n and A, how big an N can be achieved? Form III). Given n and N, how small a A can be achieved? A companion problem to the above problem (call it the first) is the (second) problem of actually constructing a code which would "implement" the answer to the first problem. In fact, it might apriori be reasonably thought that the first problem could not be solved without the second. This is not the case, and at present writing our knowledge about the first problem considerably exceeds our knowledge about the second problem. What success has beenattained in the study of the first problem is due to certain brilliant fundamental work of SHANNON, and subsequent work of the latter and other writers. This monograph is devoted to a discussion of some of the most outstanding results on the first problem for large n. We will not be restricted to the discrete channel without memory, and will consider other channels as weil. 1 E {} denotes the expected value of the chance variable in brackets. P {} denotes the probability of the set in brackets.

14 6 2. Combinatorial Preliminaries 2.Combinatorial Preliminaries 2.1. Generated sequences. In this section we shall obtain certain combinatorial properties of sequences of n integers, each integer one of 1,..., a. The motivation for our interest in these properties will become apparent in Chapter 3. By at on ce proving the necessary facts we gain in efficiency at the expense of a temporary lack of motivation. To avoid the trivial we assurne, throughout this monograph, that a :::::: 2 and n > 2; the main interest in application will usually be for large n. The use of combinatorial arguments is frequent in probability theory. We shall reverse this usual procedure and use formally probabilistic arguments to obtain combinatorial results. However, the probabilistic arguments to be employed are very simple and of no depth, and could easily be replaced by suitable combinatorial arguments. Their chief role is therefore one of convenience, to enable us to proceed with speed and dispatch. The form of these arguments will always be this: It is required to give an upper or lower bound on the pumber of sequences in a certain set. A suitable probability distribution gives a lower bound ix (upper bound ß) to the probability of each of the sequences in the set, and a lower boun:d ix l (upper bound ßI) to the probability of the set itself. Then ~1 is a lower bound on the number of sequences in the set, and ~1 is an upper bound on the number of sequences in the set. Of course, the proper choice of distribution is essential to obtaining the best values for these bounds. A sequence of n integers, each one of 1,..., a, will be called an n-sequence. Let Uo beany n-sequence. We define N (i I uo), i = 1,..., a, as the number of elements i in Uo- Let 1-1'0 and Vo be two n-sequences. We define N(i, j I uo' vo), i, j = 1,..., a, as the number of k, k= 1,...,n, such that the k th element of Uo is i and the k th element of Vo is j. Let n = (ni>..., n a) be a vector with a non-negative components which add to one. The symbol n will always be reserved for such a vector, which will be called a n-vector, a (probability) distribution or a probability vector. When we wish to specify the number of components we shall speak of a n a-vector or a probability a-vector or a distribution on a points. An n-sequence Uo will be called a n-sequence or a nn-sequence if IN(i I uo) -nnil < 2 Vanni (1 -ni)' i = 1,..., a. (2.1.1) Let w (j I i), i, j = 1,..., a, be any function of (i, j) such that w (. I i) = (w (1 I i),..., w (a I i)), i = 1,..., a (2.1.2)

15 2.1. Generated Sequences 7 is a probability vector. The significance of w (. I.) will become apparent in Chapter 3; it will be called a "channel probability function" (c.p.f.). Let lj> 2a be a number to be determined later. An n-sequence V o will be said to be generated by an n-sequence Uo if ln(i, i luo, vo) - N(i luo) w(i li) I < lj[n(i luo) w(i li) (1- w(i I i))j! (2.1.3) for all i, i = 1,..., a. Due to a convention of no importance but hallowed by tradition (of more than fifteen years!), all the logarithms in this monograph will be to the base 2. In order to avoid circumlocutions we shall adopt the convention that a quantity which, when taken literally, appears to be 0 log 0, is always equalto zero. We define the function H of a :n;-vector:n; as follows: (2.1.4) H (:n;) is called the "entropy" of :n;. Its combinatorial importance, as we shall shortly see, is that the principal parts (for large n) of the logarithms of the numbers of n-sequences with certain properties are the entropies Of certain distributions. In fact, we intend now to estimate the number BI (w I uo) of n-sequences generated by uo, where U o is any :n;-sequence, and the number B 2 (w I:n;) of different n-sequences generated by all :n;-sequences. The function defined by (2.1.4) will often occur below, andshould therefore, for convenience and brevity, have a name. However, we shall draw no implicit conclusions from its name, and shall use only such properties of H as we sha11 explicitly prove. In particular, we sha11 not erect any philosophical systems on H as a foundation. One reason for this is that we sha11 not erect any philosophical systems at a11, and sha11 confine ourselves to the proof of mathematical theorems. Let (2.1.5) be independent 2-vector chance variables with a common distribution determined by: P{X 1 = i} = :n;;, i = 1,..., a (2.1.6) Hence The vector P{Y1=iIXl=i}=w(ili), i,i=l,...,a. (2.1. 7) P{Y 1 = i} = ~:n;; w(i li) =:n;; (say), i = 1,..., a. (2.1.8) (2.1.9)

16 8 2. Combinatorial Preliminaries is of course a prob ability veetor. Also P{X '1 y '} :7lj w (i I j), ('1') ( ).. 1 1=J 1=~ =Enkw(ilk)=w J ~ say, ~,J=,...,a. k (2.1.10) Define the probability veetors Let and w'('1 i) = (w'(ll i),..., w'(a 1 i»), i = 1,..., a. (2.1.11) (2.1.12) (2.1.13) Then X and Y are chance variables whose values are n-sequenees. First we note for later referenee the trivial lemmas : Lemma We have P{X is a n-sequenee} >!. (2.1.14) This follows at onee from (2.1.1) and CHEBYSHEV'S inequality. Lemma Let Uo be any n-sequence. Then P{Y is generated by Uo 1 X = u o} 2: 1-s', (2.1.15) where s' S ~: <! ' so that s' -+ 0 as This follows at onee from (2.1.3) and CHEBYSHEV'S inequality. Let Vo be any n-sequenee generated by any n-sequenee. Then an upper bound on N (i 1 vo) is Similarly n n;+ ~ 2 V an nj w (i 1 i) + 15 ~ -V n nj + 2 Van nj" V w (i Ti) J J < n n~ + 2 ~ ~ j! nj w (il i) + 2 a 15 V; ~ V < nn; + 2a 2 V j - 4- n (1 + 15) Vn~ = ViI' is a lower bound on N (i 1 vo)' Thus 1/ 4 V- Vio =nn;-2a 2 vn (1 + 15) n; j nj w (il i) (2.1.16) II (n;) ViI < P {Y = vo} < II (n;) Vi., (2.1.17) i i from (2.1.8) and the fact that the Y/s are independently and identieally distributed. Sinee -x log x is bounded for 0 < x < 1, so is 4, II X log x = - 4 V x log V x.

17 Hence 2.1. Generated Sequences 9 4_ - 2: Vn; log n; i is bounded above by a constant multiple of a. Thus we have proved. Lemma Let Vo be any n-sequence genera ted by any n-sequence. Then exp2{-nh(n') -K 1 a3 (1 + b) V;;"} < P{Y = vo} < exp2 {- nh (n') + K 1 a 3 (1 + b) V;} where K 1 > 0 does not depend on vo' a, n, n, b, or w(./.). From Lemma we have at on ce B 2 (w / n) < exp2 {nh (n') + K 1 a3 (1 + b) Vn}. From Lemmas and we have P{Y is a sequence generated by a n-sequence} >! (l-e') >~ = 16' From (2.1.20) and Lemma we have B 2 (w / n) > exp2 {nh (n') - K 1 a 3 (1 + b) Vn}. Thus we have proved Lemma We have exp2 {nh (n') - K 2 a 3 (1 + b) V;} < B 2 (w / n) < exp2 {nh (n') + K 2 a3 (1 + b) Vn}, (2.1.18) (2.1.19) (2.1.20) (2.1.21) (2.1.22) 9. where K 2 = K 1 -log 16 > 0 and does not depend on a, n, n, b, or w (. /.). Let V o be any n-sequence generated by the n-sequence uo' Then P{Y = V o / X = uo} = II w(j / i)n(i,jluo,vo) (2.1.23) i,j exp2 {n L;niw(j / i) log w(j / i) + Vn (2a + b) L;Vw(j / i) log w(j / i)} ~ ~ < P{Y = Vo I X = uo} (2.1.24) <exp21 f n L;niw(j I i) log w(j / i) - Vn (2a +b) L;Vw(j I i) logw (j I i)ljo ~ ~ Thus we have Lemma Let Uo be any nn-sequence, and Vo be any n-sequence generated by uo' Then exp2 {-n fnih(w(./ i)) - V; (2a + b)a 2 K 3} < P{Y = Vo I X = uo} <exp2{-nfnih(w('li)) + v;i (2a+ b)a 2 K 3}, (2.1.25) where K 3 > 0 does not depend on uo, vo, a, n, n, b, or w (. I.).

18 10 2. Combinatorial Pr~1iminaries From Lemmas and we obtain Lemma Let Uo be any nn-sequence. Then exp2 {n fnih (w('1 i)) --'- Vn (2a + 15) a 2 K 4} < B1(w 1 uo) <exp2 {n fnih(w('li)) + V; (2a + 15) a 2 K 4}, (2.1.26) 3 where K 4 = K 3 -log 4" > 0 and does not depend on uo' a, n, n, 15, or W (. 1.). From Lemmas and it follows that (2.1.27) Exactly as in Lemmas and one can prove Lemma The number B 3 (n) of nn-sequences satisfies exp2 {nh(n)-kaa-} V;} < B 3 (n) < exp2 {nh(n) + K 5 3- V;}, (2.1.28) where K a > 0 does not depend on a, n, n, 15, or w (. I')' 2.2. Properties of the entropy function. If the components of n did not have to sum to unity, H (n) would obviously be a strictly concave function of these components. It remains a strictly concave function of these components in the sub-space where they sum to unity. Let W be the (a X a) matrix with element w (j 1 i) in the jth row and i th column. Writing n and n' as column vectors we have n'=wn. From this and the definition of concavity we have that H (n') = H (W n) is a concave function of the components of n. If W is non-singular the concavity must be strict. The function of n defined by H(n')-2;n;H(w('li)) =H(Wn)-2;niH(W('li)) (2.2.1) is the sum of a concave function of the components of n (i.e., H (n'» and a linear function of the components of n (i.e., - 2; nih (w (. 1 i))), and i is therefore a concave function of n. If W is non-singular the function (2.2.1) is strictly concave, and hence there is then a unique value of n at which the function attains its maximum. Even if W is singular the set of points at which the function (2.2.1) attains its maximum is a convex set. We shall always designate an arbitrary member of this set by n. All the chance variables of this section which we shall now employ to obtain properties of H are to be taken, without further statement,

19 2.2. Properties of the Entropy Function 11 as discrete chance variables, each of whose components takes at most a values; without loss of generality the latter will be assumed to be 1,..., a. Thus Xl of Section 2.1 is the most general such one-dimensional chance variable, and (Xl' Yl) is the most general such two-dimensional chance variable. We shall not always indicate the dimensionality (always finite) of a chance variable. Thus Z will in general be a vector chance variable. If Z is a (not necessarily one-dimensional) chance variable, the chance variable P {Z} which is a function of Z will be defined as follows: When Z = Z then P{Z} = P{Z = z}. We define Hence H (Z) = - E log P{Z}. H(X l ) = H(n), H(Y l ) = H(n'). We see that the entropy (function H) of a chance variable is simply the entropy of its distribution. It is nevertheless convenient thus to define H on the space of chance variables, because it makes for conciseness and brevity. Now let Z = (Zl' Z2) be a chance variable (Zl and Z2 may be vectors) and define with probability one (w. p. 1) the (chance variable) function P{Z21 Zl} of Z as follows: When (ZVZ2) = (Zl' Z2) and P{ZI = Zl} > 0, then P{Z2 I Zl} = P{Z2 = Z2 I Zl = Zl}' Also define Thus H (Z21 Zl) = - E log P{Z21 Zl}' H(Yll Xl) = ~nih (w(. I i)) i H(X I I Y l ) = ~n;h (w' (. I i)). i Finally, definethechancevariablep{z2lzl =Zl} when P{ZI =Zl} > 0, in the obvious manner as foliows: When Z2 = Z2 then P{Z21 Zl = Zl} = P{Z2 = Z2 IZI = Zl}' As before, we write H(Z21 Zl = Zl) = - E log P{Z21 Zl = Zl}' Thus H(Z21 Zl) = ~ H(Z21 Zl = i) P{ZI = i}. i Since, w. p. 1, log P{Zl} + log P{Z2 I Zl} = log P{Zl' Z2} = log P{Z2} + log P{ZII Z2} we have H (Zl' Z2) = H (Zl) + H (Z2 I Zl) = H (Z2) + H (Zl I ZJ. (2.2.2) In particular H(n)-~n;H (u,'('li)) = H(n')-~niH(w( li)). (2.2.3) i.

20 12 2. Combinatorial Preliminaries An elementary argument with a Lagrange multiplier shows that H (n) < log a, with a unique maximum at (2.2.4) We now find the maximum of H (n), subject to the constraint Oll < a'. Fix Oll' The same argument with a Lagrange multiplier gives H (n) < - Oll log Oll - (1 - Oll) log e - nil),,awith a unique maximum at (2.2.5) Now vary Oll' subject to Oll < a'. "Ve obtain: Write if a' < ~, H (n) < - a' log a' - (1 - a') log (1 - a') + (1- a') log (a -1) 'f ' > 1 1 a =a' with a unique maximum at, 1 - a' 1 - a') ( n= a'a-i""'a-i then (2.2.4) holds. (2.2.6) (2.2.7) N ow maximize H({n..}) = -"" I >}.::.., nij og nij, ',} subject to the constraints 2: n ij = ni' i = 1,..., a j 2:nij = n;, i = 1,..., a. i An elementary argument with Lagrange multipliers shows that there is a unique maximum when,.. 1 nij = ni' nj' t, J =,..., a. Hence we have that H(Xv Y I ) S H(X I ) + H(Y I ) (2.2.8) with equality when and only when the following equivalent conditions hold: Xl and Y I are independent. (2.2.9) w (. I i) is the same for all i such that ni > O. w' (. I i) is the same for all i such that n; > O. (2.2.10) (2.2.11)

ALGEBRAISCHE GEOMETRIE

ALGEBRAISCHE GEOMETRIE ERGEBNISSE DER MATHEMATIK UND IHRER GRENZGEBIETE UNTER MITWIRKUNG DER SCHRIFTLEITUNG DES "ZENTRALBLATT FOR MATHEMATIK" HERAUSGEGEBEN VON L.V.AHLFORS R.BAER R.COURANT ].L.DOOB S.EILENBERG P. R. HALMOS.