UNIT I INFORMATION THEORY Claude Shannon 1916-2001 Creator of Information Theory, lays the foundation for implementing logic in digital circuits as part of his Masters Thesis! (1939) and published a paper A Mathematical Theory of Communication (1948). The ideas from information theory have been applied to many areas such as wireless communication, video compression, bioinformatics, and others. Objective To introduce to the students the concept of amount of information, entropy, channel capacity, error-detection and error-correction codes, block coding, convolutional coding, and Viterbi decoding algorithm. to calculate the capacity of a communication channel, with and without noise; coding schemes, including error correcting codes; how discrete channels and measures of information generalize to their continuous forms; the Fourier perspective; and extensions to wavelets, complexity, compression, and efficient coding of audio-visual information. Information theory Information theory is a branch of mathematics that overlaps into communications engineering, biology, medical science, sociology, and psychology. The theory is devoted to the discovery and exploration of mathematical laws that govern the behavior of data as it is transferred, stored, or retrieved. Information theory deals with measurement and transmission of information through a channel. Whenever data is transmitted, stored, or retrieved, there are a number of variables such as bandwidth, noise, data transfer rate, storage capacity, number of channels, propagation delay, signal-to-noise ratio, accuracy (or error rate), intelligibility, and reliability. In audio systems, additional variables include fidelity and dynamic range. In video systems, image resolution, contrast, color depth, color realism, distortion, and the number of frames per second are significant variables. Information Suppose the allowable messages (or symbols) are m 1,m 2,.. and each have probability of occurrence p 1,p 2, The transmitter selects message k with probability pk. (The complete set of symbols {m1,m2,.. } is called the alphabet.) If the receiver correctly identifies the message then an amount of information I k given by I k log 2 has been conveyed. I k is dimensionless, but is measured in bits. The definition of information satisfies a number of useful criteria: It is intuitive: the occurrence of a highly probable event carries little information (I k = 0 for p k = 1). It is positive: information may not decrease upon receiving a message (I k 0 for 0 p k 1). We gain more information when a less probable message is received (I k > I l for p k < p l ). Information is additive if the messages are independent:
I k,l = log 2 =log 2 + log 2 = I k +I l. Entropy In information theory, entropy is a measure of the uncertainty associated with a random variable. Shannon entropy, which quantifies the expected value of the information contained in a message, usually in units such as bits. Equivalently, the Shannon entropy is a measure of the average information content one is missing when one does not know the value of the random variable. Suppose we have M different independent messages (as before), and that a long sequence of L messages is generated. In the L message sequence, we expect p 1 L occurrences of m 1, p 2 L of m 2, etc. The total information in the sequence is I total =p 1 Llog 2 + p 2 Llog 2 +. so the average information per message interval will be H = = p 1 log 2 + p 2 log 2 +. = This average information is referred to as the entropy.
Information rate If the source of the messages generates messages at the rate r per second, then the information rate is defined to be R r H = average number of bits of information per second. Classification of codes A binary code encodes each character as a binary string or codeword. There are two ways to encode the file, they are a fixed-length code: each codeword has the same length. ASCII, the most widely used code for representing text in computer systems, is a fixed length code a variable-length code: codewords may have different lengths. Morse code, Shannon fano code, huffmann code A prefix code: no codeword is a prefix of any other codeword. Source Coding Theorem (Shannon's first theorem) The theorem can be stated as follows: Given a discrete memoryless source of entropy H(S), the average code-word length L for any distortionless source coding is bounded as L H(S) This theorem provides the mathematical tool for assessing data compaction, i.e. lossless data compression, of data generated by a discrete memoryless source. The entropy of a source is a function of the probabilities of the source symbols that constitute the alphabet of the source.
Entropy of Discrete Memoryless Source Assume that the source output is modeled as a discrete random variable, S, which takes on symbols from a fixed finite alphabet S={s 0, s 1,.,s k-1 } With probabilities P (S = s k ) = p k, k=0,1,2,,k-1 with =1 Define the amount of information gain after observing the event S = s k as the logarithmic function I(S)=log 2 bits the entropy of the source is defined as the mean of I(s k ) over source alphabet S given by H(S)=E[I(s k )] bits The entropy is a measure of the average information content per source symbol. The source coding theorem is also known as the "noiseless coding theorem" in the sense that it establishes the condition for error - free encoding to be possible. Channel Coding Theorem (Shannon's 2nd theorem) The channel coding theorem for a discrete memoryless channel is stated in two parts as follows: (a)let a discrete memoryless source with an alphabet S have entropy H(S) and produce symbols once every T s seconds. Let a discrete memoryless channel have capacity C and be used once every T c seconds. Then if There exists a coding scheme for which the source output can be transmitted over the channel and be reconstructed with an arbitrarily small probability of error. Conversely, if It is not possible to transmit information over the channel and reconstruct with an arbitrarily small probability of error. The theorem specifies the channel capacity C as a fundamental limit on the rate at which the transmission of reliable error - free message can take place over a discrete memory less channel. Theorem: Kraft inequality
Shannon-Fano coding A variable-length coding based on the frequency of occurrence of each character. Divide the characters into two sets with the frequency of each set as close to half as possible, and assign the sets either 0 or 1 coding. Repeatedly divide the sets until each character has a unique coding. Shannon-Fano is a minimal prefix code. Huffman is optimal for character coding (one character-one code word) and simple to program. Arithmetic coding is better still, since it can allocate fractional bits, but is more complicated and has patents. Example Shannon-Fano Coding To create a code tree according to Shannon and Fano an ordered table is required providing the frequency of any symbol. Each part of the table will be divided into two segments. The algorithm has to ensure that either the upper and the lower part of the segment have nearly the same sum of frequencies. This procedure will be repeated until only single symbols are left. Symbol Frequency Code Code Total Length Length ------------------------------------------ A 24 2 00 48 B 12 2 01 24 C 10 2 10 20 D 8 3 110 24 E 8 3 111 24 ------------------------------------------ total: 62 symbols SF coded: 140 Bit linear (3 Bit/Symbol): 186 Bit
The original data can be coded with an average length of 2.26 bit. Linear coding of 5 symbols would require 3 bit per symbol. But, before generating a Shannon-Fano code tree the table must be known or it must be derived from preceding data. Step-by-Step Construction Freq- 1. Step 2. Step 3. Step Symbol quency Sum Kode Sum Kode Sum Kode ----------------------------------------------- A 24 24 0 24 00 ---------- B 12 36 0 12 01 -------------------------- C 10 26 1 10 10 ---------------------- D 8 16 1 16 16 110 ----------- E 8 8 1 8 8 111 ----------------------------------------------- Code trees according to the steps mentioned above: 1.
2. 3. Huffman coding Definition: A minimal variable-length character coding based on the frequency of each character. First, each character becomes a one-node binary tree, with the character as the only node. The character's frequency is the tree's frequency. Two trees with the least frequencies are joined as the subtrees of a new root that is assigned the sum of their frequencies. Repeat until all characters are in one tree. One code bit represents each level. Thus more frequent characters are near the root and are coded with few bits, and rare characters are far from the root and are coded with many bits. The worst case for Huffman coding (or, equivalently, the longest Huffman coding for a set of characters) is when the distribution of frequencies follows the Fibonacci numbers. Joining trees by frequency is the same as merging sequences by length in optimal merge. Since a node with only one child is not optimal, any Huffman coding corresponds to a full binary tree. Huffman coding is one of many lossless compression algorithms. This algorithm produces a prefix code.
Example: "abracadabra" Symbol Frequency a 5 b 2 r 2 c 1 d 1 According to the outlined coding scheme the symbols "d" and "c" will be coupled together in a first step. The new interior node will get the frequency 2. Step 1 Symbol Frequency Symbol Frequency a 5 a 5 b 2 b 2 r 2 r 2 c 1 -----------> 1 2 d 1
Code tree after the 1st step: Step 2 Symbol Frequency Symbol Frequency a 5 a 5 b 2 b 2 r 2 -----------> 2 4 1 2 Code tree after the 2nd step: Step3 Symbol Frequency Symbol Frequency a 5 a 5 2 4 -----------> 3 6 b 2 Code tree after the 3rd step:
Step 4 Symbol Frequency Symbol Frequency 3 6 -----------> 4 11 a 5 Code tree after the 4th step: Code Table If only one single node is remaining within the table, it forms the root of the Huffman tree. The paths from the root node to the leaf nodes define the code word used for the corresponding symbol: Symbol Frequency Code Word a 5 0 b 2 10 r 2 111 c 1 1101 d 1 1100 Complete Huffman Tree:
Applications of Huffman Codes Lossless Image Compression Text Compression Lossless Audio Compression Block Huffman Codes (or Extended Huffman Codes) If the source alphabet is rather large,p max is likely to be comparatively small. On the other hand, if the source alphabet contains only a few symbols, the chances are that p max is quite large compared to the other probabilities. The average code length is upper bounded by H(X)+p max +0.086. This seems to be the assurance that the code is never very bad.
Discrete Memoryless Channel A communication Channel may be defined as the path or medium through which the symbols flow to the receiver end. A Discrete Memoryless Channel (DMC) is a statistical model with an input X and an output Y as shown in figure.during each unit of time,( the signaling interval ), the channel accepts an input symbol from X, and in response an output symbol from Y. The channel is said to be "discrete" when the alphabets of X and y are both finite.also,it is said to be "memoryless" and not on any of the previous inputs. Discrete Memoryless Channels Binary Symmetric Channel Binary Erasure Channel Asymmetric Channel