Introduction to Information Theory. By Prof. S.J. Soni Asst. Professor, CE Department, SPCE, Visnagar

Introduction to Information Theory By Prof. S.J. Soni Asst. Professor, CE Department, SPCE, Visnagar

Introduction [B.P. Lathi] Almost in all the means of communication, none produces error-free communication. We may able to improve the accuracy in digital signals by reducing the error probability P e. In all the digital systems, P e varies as e keb asymptotically. By increasing E b, the energy per bit, we can reduce P e to any desired level. 2

Introduction Now the signal power is S i = E b R b, where R b is the bit rate. Hence, increasing E b means either increasing the signal power S i (for a given bit rate), decreasing the bit rate R b (for a given power), or both. Because of physical limitations, however, S i cannot be increased up to some limit. Hence to reduce P e further, we must reduce R b, the rate of transmission of information digits. 3

Introduction In communication, if noise exists, we could not get error free communication. P e -> 0 if R b -> 0 Shannon in 1948, published a paper title A mathematical theory of communication P e -> 0 if R b < C (channel capacity) Then still we can have error free transmission. 4

Introduction Disturbances which occur on a communication channel do not limit the accuracy of transmission, what it limits is the rate of transmission of information. Information Theory is a mathematical science. The word, Information is very deceptive. 5

Information 1. John was dropped to the airport by taxi. 2. The taxi brought John to the airport. 3. There is a traffic jam on highway NH5, between Mumbai and Pune in India. 4. There is a traffic jam on highway NH5 in India. 6

Information Syntactic Information It is related with the symbols which we used to built up our message. In sentences 1 & 2, information is same, but syntactically they differ. Semantic Information Meaning of the message Sentences 3 & 4 are syntactically different, but they are also different semantically. 7

Information Pragmatic Information It is related with the effect and the usage of the message. Sentences 3 & 4 For out of country people, above two sentences are not important. 8

Syntactic Information For two sentences, we may use different symbols, but ultimate meaning remains same. Source Channel Destination Source Encoder Channel Encoder Channel Decoder Source Decoder Shannon did, way to get, optimal Prof. S.J. source Soni, SPCE, encoder Visnagar and channel encoder. 9

Example Binary Data Consider A, B and C are the three cities. A B weather status? A C Sunny 00 (1/4) Sunny 1110 (1/8) Rainy 01 (1/4) Rainy - 110 (1/8) Cloudy 10 (1/4) Cloudy - 10 (1/4) Foggy - 11 (1/4) Smoggy 0 (1/2) 10

Example What is the difference between communication link from B to A and from C to A? Cost of operation of communication. No. of bits/message/second on average 11

Example From C to A L average = 4X1/8 + 3X1/8 + 2X1/4 + 1X1/2 = 1 7/8 binits(binary digits)/message From B to A L average = 2X1/4 + 2X1/4 + 2X1/4 + 2X1/4 = 2 binits/message [more than 1 7/8] 12

Example Is it possible for me to get a mapping which is better than this? In case, if it is possible, then how low can I go? If it is possible to go, then how do I synthesis this mapping? Code?? E.g. 01, 11, 111 etc. 13

Measure of Information [BP Lathi] Commonsense Measure of Information Consider the following headlines in a morning paper. 1. There will be a daylight tomorrow. 2. China invades the India. 3. India invades the China. Reader s interest, amount of information, probabilities of occurrences of the events. 14

Measure of Information Commonsense Measure of Information The information is connected with the element of surprise, which is a result of uncertainty or unexpectedness. If P is the probability of occurrence of a message and I is the information gained from the message, it is evident from the preceding discussion that when P 1, I 0 and when P 0, I, and in general a smaller P gives a larger I. So, I ~ log 1/P 15

Measure of Information Engineering Measure of Information From an engineering point of view, the amount of information in a message is proportional to the (minimum) time required to transmit the message. This implies that a message with higher probability can be transmitted in a shorter time than that required for a message with lower probability. This fact may be verified by three city example. E.g. 1/8 4 bits, ½ 1 bit 16

Measure of Information Engineering Measure of Information Let us assume for two equiprobable messages m1 and m2, we may use binary digits 0 and 1 respectively. For four equiprobable messages m1,m2,m3, and m4, we may use binary digits 00,01,10,11 respectively. For eight equiprobable messages m1,..,m8, we may use binary digits 000,001,..,111 respectively. 17

Measure of Information Engineering Measure of Information In general, we need log 2 n binary digits to encode each of n equiprobable messages. The probability, P, of any one message occurring is 1/n. Hence to encode each message (with probability P), we need log 2 (1/P) binary digits. The information I contained in a message with probability of occurrence P is proportional to log 2 (1/P). I = k log 2 (1/P) => I = log 2 (1/P) bits 18

Average Information per Message: Entropy of a Source [BP Lathi] Consider a memoryless source m emitting messages m 1, m 2,.., m n with probabilities P 1, P 2,, P n respectively (P 1 + P 2 +.. + P n =1) A memoryless source implies that each message emitted is independent of the previous message(s). The information content of message m i is I i, I i = log (1/P i ) bits 19

Average Information per Message: Entropy of a Source Hence, the mean, or average, information per message emitted by the source is given by n i=1 P i I i bits The average information per message of a source m is called its entropy, denoted by H(m). H(m) = n i=1 P i I i bits = n i=1 P i log (1/ P i ) bits = - n i=1 P i log P i bits 20

Source Encoding Huffman Code The source encoding theorem says that to encode a source with entropy H(m), we need, on average, a minimum of H(m) binary digits per message. The number of digits in the codeword is the length of the codeword. Thus the average word length of an optimal code is H(m). 21

Huffman Code Example Consider the six messages with probabilities 0.30, 0.25, 0.15, 0.12, 0.08 and 0.10 respectively. Original Source Reduced Sources Messages Probabilities S1 S2 S3 S4 m1 0.30 0.30 0.30 0.43 0.57 m2 0.25 0.25 0.27 0.30 0.43 m3 0.15 0.18 0.25 0.27 m4 0.12 0.15 0.18 m5 0.08 0.12 m6 0.10 22

Huffman Code Example Original Source Reduced Sources Messages Probabilities S1 Code S2 Code S3 Code S4 Code m1 0.30 00 0.30 00 0.30 00 0.43 1 0.57 0 m2 0.25 10 0.25 10 0.27 01 0.30 00 0.43 1 m3 0.15 010 0.18 11 0.25 10 0.27 01 m4 0.12 011 0.15 010 0.18 11 m5 0.08 110 0.12 011 m6 0.10 111 The optimum (Huffman) code obtained this way is called a compact code. The average length of the compact code is: L = n i=1 P i L i = 0.3(2) + 0.25(2) + 0.15(3) + 0.12(3) + 0.1(3) + 0.08(3) = 2.45 binary digits The entropy H(m) of the source is given by H(m) = n i=1 P i log 2 (1/P i ) = 2.418 bits Hence, the minimum possible length is 2.418 bits. By using direct coding (the Huffman code), it is possible to attain an average length of 2.45 bits in the example given. 23

Huffman Code Example Original Source Reduced Sources Messages Probabilities S1 Code S2 Code S3 Code S4 Code m1 0.30 00 0.30 00 0.30 00 0.43 1 0.57 0 m2 0.25 10 0.25 10 0.27 01 0.30 00 0.43 1 m3 0.15 010 0.18 11 0.25 10 0.27 01 m4 0.12 011 0.15 010 0.18 11 m5 0.08 110 0.12 011 m6 0.10 111 L = n i=1 P i L i = 0.3(2) + 0.25(2) + 0.15(3) + 0.12(3) + 0.1(3) + 0.08(3) = 2.45 binary digits The entropy H(m) of the source is given by H(m) = n i=1 P i log 2 (1/P i ) = 2.418 bits Code Efficiency Ƞ = H(m) / L = 2.418/2.45 = 0.976 Redundancy γ = 1 - Ƞ = 1 0.976 = 0.024 24

Huffman Code Even though the Huffman code is a variable length code, it is uniquely decodable. If we receive a sequence of Huffman-coded messages, it can be decoded only one way, that is, without ambiguity. For example, m 1 m 5 m 2 m 1 m 4 m 3 m 6 it would be encoded as 001101000011010111 we can verify that it will be decoded in only one way. 25

Example A memoryless source emits six messages with probabilities 0.3, 0.25, 0.15, 0.12, 0.1, and 0.08. Find the 4-ary(quaternary) Huffman Code. Determine its average word length, the efficiency, and the redundancy. 1. Minimum no. of messages = r + k (r-1) = 4 + 1(4-1) = 7 2. Create Table with last column contains 4 messages. 3. Calculate L = 1.3 4-ary digits 4. H 4 (m) = 1.209 4-ary units = - 6 i=1 P i log 4 P i 5. Code efficiency = 0.93 6. Redundancy = 0.07 26

Other GTU Examples Apply Huffman coding method for the following message ensemble: [X] = [X1 X2 X3 X4 X5 X6 X7] [P] = [ 0.4 0.2 0.12 0.08 0.08 0.08 0.04] Take M = 2 Calculate: (i) Entropy (ii) Average Length (iii) Efficiency. Define Entropy and its unit. Explain Huffman Coding technique in detail. Explain how the uncertainty and the information are related and entropy of a discrete source is determined. Find a quaternary compact code for the source emitting symbols s1, s2,, s11 with the corresponding probability, 0.21, 0.16, 0.12, 0.10, 0.10, 0.07, 0.07, 0.06, 0.05, 0.05, and 0.01, respectively. 27

Tutorial Problems Book: Modern Digital and Analog Communications Systems by B.P. Lathi. Chapter 13: Introduction to information theory Exercise Problems. 13.1-1, 13.2-1, 13.2-2, 13.2-3, 13.2-4, 13.2-5, 13.2-6 28

Coding [Khalid Sayood] It is assignment of binary sequences to elements of an alphabet. The set of binary sequences is called a code, and the individual members of the set are called codewords. An alphabet is a collection of symbols called letters. E.g. Alphabet used in writing books. ASCII code for each letter. It s called fixed-length coding. 29

Statistical Methods [David Salomon] Statistical methods use variable-size codes, with the shorter codes assigned to symbols or groups of symbols that appear more often in the data (have a higher probability of occurrence). Designers and implementers of variable size codes have to deal with the two problems Assigning codes that can be decoded unambiguously Assigning codes with the minimum average size. 30

Uniquely Decodable Codes [K. Sayood] The average length of the code is not only important point in designing a good code. Consider following example. Suppose source alphabet consists of four letters a1, a2, a3 and a4. With probabilities P(a1)=1/2, P(a2)=1/4, P(a3)=P(a4)=1/8. The entropy for this source is 1.75 bits/symbol. Letters Probability Code 1 Code 2 Code 3 Code 4 a1 0.5 0 0 0 0 a2 0.25 0 1 10 01 a3 0.125 1 00 110 011 a4 0.125 10 11 111 0111 Average Length 1.125 1.25 1.75 1.875 31

Uniquely Decodable Codes Based on the average length, Code 1 appears to be the best code. However, it is ambiguous. Because both a1 and a2 assigned the same code 0. In Code 2, each symbol is assigned a distinct codeword. But still it is ambiguous. For example to send a2 a1 a1 we send 100. Decoder can decode a2 a1 a1 OR a2 a3. Code 3 and Code 4 both are uniquely decodable. Both are unambiguous. In case of Code 3, the decoder knows the moment a code is complete. In code 4, we have to wait till the beginning of the next codeword before we know that the current codeword is complete. Because of this, Code 3 is called an instantaneous code. While Code 4 is not an instantaneous code. 32

Sardinas-Patterson Theorem 33

Example Consider the code. This code is an example of a code which is not uniquely decodable, since the string 011101110011 can be interpreted as the sequence of codewords 01110 1110 011, but also as the sequence of codewords 011 1 011 10011. Two possible decodings of this encoded string are thus given by cdb and babe. 34

Example Here, {1, 011, 01110, 1110, 10011} 1 is a prefix of 1110 [dangling suffix is 110], so {1, 011, 01110, 1110, 10011, 110} 1 is also a prefix of 10011 [dangling suffix is 0011], so {1, 011, 01110, 1110, 10011, 110, 0011} 011 is prefix of 01110 [dangling suffix is 10], so {1, 011, 01110, 1110, 10011, 110, 0011, 10} Here, for added dangling suffix 10, there is another dangling suffix we get, 10 10011 so, 011, which is already a codeword for b, so given code is not uniquely decodable. 35

Other Examples [K. Sayood] Consider the code {0, 01, 11} and prove that it is uniquely decodable. Consider the code {0, 01, 10} and prove that it is not uniquely decodable. 36

GTU Examples What is uniquely decodable code? Check which codes are uniquely decodable and instantaneous? S1={101,11,00,01,100} S2={0,10,110,1110,..} S3={02,12,20,21,120} What is the need of instantaneous codes? Explain the instantaneous codes in detail with suitable example. 37

Kraft-McMillan Inequality [Khalid Shayood] We divide this topic in two parts. The first part provides the necessary condition on the codeword lengths of uniquely decodable codes. The second part shows that we can always find a prefix code that satisfy this necessary condition. 38

Kraft-McMillan Inequality Let C be a code with N codewords with lengths l 1,l 2, ln, and l 1 l 2 ln. If C is uniquely decodable, then : K( C) N i 1 2 l i 1 39

Shannon-Fano Coding [D. Salomon] Prob. Steps Final 1 0.25 1 1 11 2 0.20 1 0 10 3 0.15 0 1 1 011 4 0.15 0 1 0 010 5 0.10 0 0 1 001 6 0.10 0 0 0 1 0001 7 0.05 0 0 0 0 0000 The average size of this code is 0.25X2 + 0.2X2+ 0.15X 3+ 0.15X3+0.1X3+0.1X4+0.05X4 = 2.7 bits/symbol. Entropy is near to 2.67, so result is good. 40

Shannon-Fano Coding [D. Salomon] Repeat the calculation above but place the first split between the third and fourth symbol. Calculate the average size of the code and show that it is greater than 2.67 bits/symbol. This suggest that the shannon-fano method produces better code when the splits are better. 41

GTU Examples Apply Shanon Fano coding algorithm to given message ensemble: [X] = [X 1 X 2 X 3 X 4 X 5 X 6] P[X] = [ 0.3 0.25 0.2 0.12 0.08 0.05] Take M=2 Find out: (i) Entropy (ii) Code words (iii) average length (iv) efficiency Explain Shanon-Fano code in detail with example. Mention its advantages over other coding schemes. 42

Arithmetic Coding [David Salomon] If a statistical method assign 90% probability to a given character, the optimal code size would be 0.15 bits. While the Huffman coding system would probably assign a 1-bit code to the symbol, which is six times longer than necessary. Arithmetic coding bypasses the idea of replacing an input symbol with a specific code. It replaces a stream of input symbols with a single floating point output number. 43

Character probability Range ^(space) 1/10 A 1/10 B 1/10 E 1/10 G 1/10 I 1/10 L 2/10 S 1/10 T 1/10 Suppose that we want to encode the message BILL GATES 44

Arithmetic Coding Encoding algorithm for arithmetic coding: low = 0.0 ; high =1.0 ; while not EOF do range = high - low ; read(c) ; high = low + range high_range(c) ; low = low + range low_range(c) ; end do output(low); 45

Arithmetic Coding To encode the first character B properly, the final coded message has to be a number greater than or equal to 0.20 and less than 0.30. range = 1.0 0.0 = 1.0 high = 0.0 + 1.0 0.3 = 0.3 low = 0.0 + 1.0 0.2 = 0.2 After the first character is encoded, the low end for the range is changed from 0.00 to 0.20 and the high end for the range is changed from 1.00 to 0.30. 46

Arithmetic Coding The next character to be encoded, the letter I, owns the range 0.50 to 0.60 in the new subrange of 0.20 to 0.30. So, the new encoded number will fall somewhere in the 50th to 60th percentile of the currently established. Thus, this number is further restricted to 0.25 to 0.26. 47

Arithmetic Coding Note that any number between 0.25 and 0.26 is a legal encoding number of BI. Thus, a number that is best suited for binary representation is selected. (Condition : the length of the encoded message is known or EOF is used.) 48

Arithmetic Coding Character Prob. Range Range High Low B 0.2 0.3 Low =0 High=1 Range = 1 I 0.5 0.6 Low = 0.2 High=0.3 Range = 0.1 L 0.6 0.8 Low = 0.25 High=0.26 Range = 0.01 0 + 1 X 0.3 =0.3 0.2 + 0.1 X 0.6 = 0.26 0.25 + 0.01X0.8 =0.258 0 + 1 X 0.2 =0.2 0.2 + 0.1 X 0.5 = 0.25 0.25 + 0.01X0.6 =0.256 49

1.0 0.3 0.26 0.258 0.2576 0.25724 0.25722 0.2572168 0.257216776 0.2572168 0.9 0.8 T S L T S L T S L T S L T S L T S L T S L T S L T S L T 0.257216775 6 S 0.257216775 L 2 0.6 0.5 0.4 0.3 0.2 0.1 I G E B A ( ) I G E B A ( ) I G E B A ( ) I G E B A ( ) I G E B A ( ) I G E B A ( ) I G E B A ( ) I G E B A ( ) I G E B A ( ) I G E B A ( ) 0.0 0.2 0.25 0.256 0.2572 0.2572 0.257216 0.25721676 50 0.2572164 0.257216772

Arithmetic Coding Character Low High B 0.2 0.3 I 0.25 0.26 L 0.256 0.258 L 0.2572 0.2576 ^(space) 0.25720 0.25724 G 0.257216 0.257220 A 0.2572164 0.2572168 T 0.25721676 0.2572168 E 0.257216772 0.257216776 S 0.2572167752 0.2572167756 51

Arithmetic Coding So, the final value 0.2572167752 (or, any value between 0.2572167752 and 0.2572167756, if the length of the encoded message is known at the decode end), will uniquely encode the message BILL GATES. 52

Arithmetic Coding Decoding is the inverse process. Since 0.2572167752 falls between 0.2 and 0.3, the first character must be B. Removing the effect of B from 0.2572167752 by first subtracting the low value of B, 0.2, giving 0.0572167752. Then divided by the width of the range of B, 0.1. This gives a value of 0.572167752. 53

Arithmetic Coding In summary, the encoding process is simply one of narrowing the range of possible numbers with every new symbol. The new range is proportional to the predefined probability attached to that symbol. Decoding is the inverse procedure, in which the range is expanded in proportion to the probability of each symbol as it is extracted. 54

Arithmetic Coding Coding rate approaches high-order entropy theoretically. Not so popular as Huffman coding because, are needed. 55

Example Apply Arithmetic Coding to following string: SWISS_MISS 56

Shannon s Theorem & Channel Capacity Shannon's Theorem gives an upper bound to the capacity of a link, in bits per second (bps), as a function of the available bandwidth and the signalto-noise ratio of the link. The Theorem can be stated as: C = B * log 2 (1+ S/N) where C is the achievable channel capacity, B is the bandwidth of the line, S is the average signal power and N is the average noise power. 57

Shannon s Theorem & Channel Capacity The signal-to-noise ratio (S/N) is usually expressed in decibels (db) given by the formula: 10 * log 10 (S/N) so for example a signal-to-noise ratio of 1000 is commonly expressed as 10 * log 10 (1000) = 30 db. 58

Shannon s Theorem & Channel Capacity Here is a graph showing the relationship between C/B and S/N (in db): 59

Examples Here are two examples of the use of Shannon's Theorem. Modem For a typical telephone line with a signal-tonoise ratio of 30dB and an audio bandwidth of 3kHz, we get a maximum data rate of: C = 3000 * log 2 (1001) = 3000*9.967=29901 which is a little less than 30 kbps (30720). 60

Examples Satellite TV Channel For a satellite TV channel with a signal-to noise ratio of 20 db and a video bandwidth of 10MHz, we get a maximum data rate of: C=10000000 * log 2 (101) which is about 66 Mbps. 61

Other Examples The signal-to-noise ratio is often given in decibels. Assume that SNR db = 36 and the channel bandwidth is 2 MHz. The theoretical channel capacity can be calculated as 62

My Blog worldsj.wordpress.com 63