Using an innovative coding algorithm for data encryption

Size: px
Start display at page:

Download "Using an innovative coding algorithm for data encryption"

Transcription

1 Using an innovative coding algorithm for data encryption Xiaoyu Ruan and Rajendra S. Katti Abstract This paper discusses the problem of using data compression for encryption. We first propose an algorithm for breaking a prefix-coded file by enumeration. Based on the algorithm, we respectively analyze the complexity of breaking Huffman codes and Shannon-Fano-Elias codes under the assumption that the cryptanalyst knows the code construction rule and the probability mass function of the source. It is shown that under this assumption Huffman codes are vulnerable but Shannon-Fano-Elias codes need exponential time to break. We then propose an innovative construction of variable-length prefix codes that have the same security as Shannon-Fano-Elias codes but the expected length is smaller. keywords Cryptography, data compression, Huffman codes, prefix codes, Shannon- Fano-Elias codes, variable-length codes. This work has been supported in part by the National Science Foundation under Grant CCR The authors are with the Department of Electrical and Computer Engineering, North Dakota State University, Fargo, ND , USA ( xiaoyu.ruan@ndsu.edu; rajendra.katti@ndsu.edu).

2 Introduction Huffman coding [] is one of the best-known compression techniques that produces optimal compression for any given probability distribution. The use of Huffman codes for encryption has been considered in [2] and [3]. These works were motivated by security requirements for storing a large textual database on a CD-ROM. The text of the database needed to be compressed for memory efficiency and encrypted to prevent illegal use of copyrighted material. Using variable-length prefix codes for encryption does not lead to absolute secrecy as might be required by some military applications but results in making the cryptanalysis difficult enough so that the cost of decryption exceeds any potential profit incurred by breaking the code. Another application where compression and encryption are needed is embedded multimedia systems [4], such systems require both data and programs to be compressed and encrypted and stored in main memory. Both data and programs are decrypted and decompressed after they enter the processor which is considered to be secure. In this case using compression as a means to achieve security can improve performance and memory requirement of the system. Increased security can be obtained by using simpler encryption methods in addition to using compression. In [5], it was shown that if the cryptanalyst knows whether the encoder is using an arbitrary Huffman code or a right-heavy Huffman code 2, but does not know the Probability mass function (PMF) of the source, then a Huffman coded file can be surprisingly difficult to cryptanalyze. However, this assumption of the cryptanalyst not knowing the PMF is invalid Whenever two subtrees are combined, an arbitrary decision may be made as to which subtree has greater total weight (probability). 2 The subtree of greater total weight is always made the right subtree. 2

3 especially for text data where the PMF is usually known. In this paper we assume that the cryptanalyst not only knows the construction procedure of the codes but also knows the PMF. Under this assumption, Huffman codes can be easily decoded and essentially provide no security. Fraenkel and Klein [6] have enhanced prefix codes by adding a short sequence of random bits to some of the codewords. They then show that decoding such a code is NP-complete. However the drawback of this method is that adding random bits increases the average length of the code. A better method would be to XOR the compressed sequence with a sequence of bits obtained by encrypting a seed as was done in [7]. This would maintain the compression properties and improve the security of the code. In this paper we consider using a variable-length prefix code for encryption. Although Huffman codes produce optimal compression, their encryption capabilities are not very good. Shannon-Fano-Elias codes [8] are a better candidate for encryption but their compression capabilities are inferior compared to Huffman codes. In this paper we show that it is more difficult to cryptanalyze Shannon-Fano-Elias codes than Huffman codes. We then propose a new code that is as good for encryption as Shannon-Fano-Elias codes but is almost as good as Huffman codes for compression. Assume that the information source puts out a sequence of symbols from the set X = {x, x 2,..., x n } with PMF {p(x ), p(x 2 ),..., p(x n )}. This implies that p(x = x i ) = where i =, 2,..., n. Each symbol in the sequence is then assigned a codeword such that the expected length of the code is small, and it is hard for an intruder to decode the sequence of codewords into the symbols being output by the source. The expected length is defined as L = n l(x i ), () i= 3

4 where l(x i ) denotes the length of the codeword for the symbol x i with probability. It is assumed that the intruder knows the PMF and the coding method. In Huffman coding, the probabilities of source symbols must be ordered non-increasingly before encoding. In other words, the probabilities of source symbols must be examined at least twice before proper codewords can be assigned to each symbol: once for reordering symbols and the other for assigning codewords. Huffman codes have an optimal expected length H L Huffman < H + where H represents the entropy of the source. Shannon mentioned an approach using the cumulative distribution function when describing the Shannon-Fano code [9]. Elias later came up with a recursive implementation for this idea. It is now known as Shannon-Fano-Elias coding. However Elias never published it. It was first introduced in a 963 information theory book by Abramson [0]. We now review the construction of Shannon-Fano-Elias codes. The modified cumulative distribution function is defined as i F () = p(x k ) + 2 p(x i). (2) F () represents the sum of probabilities of symbols x through x i plus half of the probability of x i. Notice that it is not required to order the symbols based on their probabilities. Therefore different orderings of symbols lead to different codes. Since the random variable is discrete, the cumulative function consists of steps of size. Therefore we can determine x i if we know F (). In general F () is a real number only expressible by an infinite number of bits. F () is rounded off to l SF E (x i ) = log + (3) bits, denoted by F () l SF E (x i ). It can be shown that the set of lengths l SF E(x i ) ( i n) satisfies the Kraft inequality [] and hence can be used to construct a uniquely decodable code. F (p(xi )) l SF E(x i ) is within the step corresponding to x i. Thus we can use the first 4

5 l SF E (x i ) bits of F () to describe x i. Shannon-Fano-Elias coding uses the binary form of F (p(xi )) l SF E(x i ) as the codeword for x i. The resulting codewords prove to be prefix-free. The expected length L SF E computed using () satisfies H + L SF E < H + 2. We remark that, compared to Huffman coding, in Shannon-Fano-Elias coding the probabilities of symbols can be arranged in any order. There exist n! different orders. This feature might be employed to increase the ambiguity of the codes and hence make the code difficult to break. For example, probabilities of the English alphabet occurring in literature [2] are shown in the second column of Table 3 or Table 4. The corresponding codewords for the alphabet ordered from A to Z are listed in the third column of Table 3 and that for the order Z to A are listed in the third column of Table 4. For the English alphabet there exist 26! different permutations of symbols. Table 3 and Table 4 show two out of the 26! cases. At a glance, one may wonder if the resulting codeword set for the reverse order of symbols is always the complement of the codeword set for the original order or not. In fact, this is not true. For example, for a 3-symbol source {s, s 2, s 3 }, let p(s ) = p(s 2 ) = 0.25 and p(s 3 ) = 0.5. If symbols are ordered as s, s 2, s 3, then the codewords are 00, 0, and for s, s 2, and s 3, respectively. If symbols are ordered as s 3, s 2, s, then the codewords are 0, 0, and for s 3, s 2, and s, respectively. The the rest of the paper is organized as follows. Section 2 gives an algorithm for breaking a prefix code by enumeration and analyzes the complexity of breaking a Huffman code and a Shannon-Fano-Elias code. Although Shannon-Fano-Elias coding results in better security than Huffman coding, as we will see later, it increases the expected code length thus making it inefficient. In Section 3 we propose an innovative coding algorithm that results in the same security as Shannon-Fano-Elias coding but the expected code length is less. In Section 4 we prove some properties of the proposed code. Numerical examples are shown in Section 5. Finally Section 6 concludes the paper. 5

6 2 Breaking a prefix code by enumeration Suppose the cryptanalyst wants to break a prefix-coded sequence. In order to do this his goal is to find out the codeword set used by the encoder. Once this is done then some efficient method can perhaps be utilized to match codewords and symbols. Assume that he has worked out several candidate codeword sets by some means. Furthermore, he has received a few encoded sequences. Now he needs to examine the candidate sets and find out the right one. Let S 0 be the encoded sequence to be analyzed. S 0 is one of the encoded sequences that the cryptanalyst has received. Variable S is initialized to S 0 and will be progressively updated by Algorithm. S is a binary sequence denoted by (b b 2 ). Algorithm scans S 0 and checks whether a candidate codeword set ϕ = {C, C 2,..., C n, C n } could be the one used for encoding. Without loss of generality we let length(c ) length(c 2 )... length(c n ) length(c n ) where function length(c i ) returns the length of codeword C i. Algorithm first searches S for the longest codeword and then progressively searches for shorter and shorter codewords. Suppose sequence C i occurs t (t > 0) times in S. Notice that an occurrence may overlap with others. The bits to the left of each of the t occurrences are checked. If none of them match any codeword in the candidate set, then each of the t found sequences, though the same as C i, is not actually a codeword, but either the combination of a suffix and a prefix of two codewords or an infix of a codeword. i.e., C i is not in the codeword set used for encoding. If for some of the t occurrences, the left neighboring bits match a codeword in the candidate set, then C i is indeed in the codeword set used for encoding. The rest of the occurrences are either combinations of a suffix and a prefix of two codewords or infix of a codeword. We then remove the occurrences of C i which have codewords to their left. The updated S is again searched for the next-longest codeword, i.e., C i+. The process 6

7 Algorithm Check if a candidate codeword set may be the one used for encoding or cannot be the one used for encoding. i S S 0 γ π while i n do β search for C i in S from left to right, t C i s are found if t = 0 then γ γ {C i } else for k t do p k index of the leftmost bit of the k th C i end for if p = then β β {} end if for k t do if p k length(c n ) then for length(c n ) j min{length(c i ), p k } do if (b pk jb pk j+... b pk 2b pk ) ϕ then β β {k} break end if end for end if end for if β = then γ γ {C i } else π π {C i } S removing the k th C i from S (for all k β) end if end if i i + end while if S is empty then return ( This codeword set may be the one used for encoding. Elements in π are in the codeword set used for encoding. Unable to decide if elements in γ are in the codeword set used for encoding or not. ) else return ( This codeword set is not the one used for encoding. ) end if 7

8 continues until all codewords in the candidate set are checked. At the end of this process, if S is empty then the candidate set could be the set for encoding, otherwise the candidate set is not the one used for encoding. Since the code is prefix-free, a codeword is not a prefix of any other codewords. Furthermore, a codeword must not be an infix or suffix of other codewords of equal or smaller length. This explains the principle and correctness of Algorithm. To illustrate the operations of Algorithm, we consider an example that uses Shannon- Fano-Elias coding. Table shows the source symbols and PMF. There are three symbols and thus 3! = 6 different ways in which they can be ordered. We define an order to be a key, which has been transmitted to legal receivers via private channels. Suppose one of the encoded sequences the cryptanalyst received is Firstly the codeword set {0, 0, 0}, corresponding to key ABC, is examined. There does not exist 0 in the sequence. Searching for 0 we find one occurrence: ˆ The leading bit is hatted. The bits to the left of 0 are neither 0 nor 0, so 0 is not in the codeword set used for encoding. Finally we search for 0 and find 5 occurrences: ˆ0ˆ0ˆ0ˆ00ˆ00ˆ0ˆ00000ˆ0000ˆ0ˆ0ˆ0ˆ00ˆ00000ˆ00000ˆ0. We then remove the 0s that have a valid codeword to their left. Except for the leftmost 0, the left neighboring bits of the removed 0s are also 0. The resulting sequence then is

9 Table : An Example for Algorithm Key x i Symbol Codeword ABC x A x 2 B x 3 C ACB x A x 2 C x 3 B BAC x B x 2 A x 3 C BCA x B x 2 C x 3 A CBA x C x 2 B x 3 A CAB x C x 2 A x 3 B

10 Since the process ends up with a non-empty sequence, Algorithm concludes that {0, 0, 0} is not the right set. Candidate sets {000, 0, 0}, {0, 00, 0}, {00, 00, 0}, and {0000, 0, 0} can be eliminated likewise. Next consider {0000, 00, 0} corresponding to key CBA occurs four times in the sequence: ˆ0000ˆ ˆ00000ˆ0000. Picking any one of the four, we find that the left neighboring bits are in {0000, 00, 0}. We then update S by removing all four occurrences of 0000: Searching for 00 we find 2 occurrences: ˆ0ˆ0ˆ0ˆ00ˆ00ˆ0ˆ00ˆ0ˆ0ˆ00ˆ0ˆ00. We then remove the 00s with valid codewords to their left. Except for the leftmost 00, the left neighboring bits of the removed 00s are either 00 or 0. The remaining sequence is made up of consecutive codewords 0, which will be removed. Finally an empty sequence results. Therefore {0000, 00, 0} must be the codeword set used for encoding since all of the other five candidate sets have been eliminated. The corresponding key is CBA. The plaintext is BAAABABCCBAABCAC. It is possible that more than one candidate set could be the one used for encoding. In this case, Algorithm only reduces the size of the candidate set space, but does not specify which is the right one. Therefore, the cryptanalyst may need more than one encoded sequence to make a correct judgement. He has to check each of the candidate sets that have passed 0

11 the examinations by Algorithm for previous sequences. This is performed until only one candidate set is left. For example, consider the encoded sequence 000 which can be split into: For this sequence, Algorithm eliminates candidate sets {0, 00, 0}, {00, 00, 0}, and {0000, 00, 0}. Each of the other three sets could be the one used for encoding because they all contain codewords 0 and 0. The size of the candidate set space is reduced from 6 to 3. In order to find out the right one, analysis of more sequences is necessary. The general process of breaking a prefix-coded sequence is:. Find all possible codeword sets (i.e., candidates). 2. For an encoded sequence S 0, check all candidates by repeatedly using Algorithm. 3. Record the candidates that may be the one used for encoding. 4. If more than one set results in step 3) then pick another encoded sequence S 0 and repeat steps 2) and 3) until only one candidate is left. This is the one used for encoding. We now refine our attention and analyze Huffman codes and Shannon-Fano-Elias codes. For Huffman coding, if the cryptanalyst knows the construction rule and PMF then there would be only one candidate codeword set. There is no ambiguity at all. Thus Huffman coding is easy to break and is not applicable when encryption is required along with compression. For Shannon-Fano-Elias coding, even if the cryptanalyst knows the construction rule and PMF, there are still n! different possible keys. These keys may result in at most

12 n! candidate codeword sets. Notice that different keys may sometimes result in the same codeword set. n! is an exponential quantity according to Stirling s formula. Therefore it is practically impossible to break a Shannon-Fano-Elias encoded sequence by using only enumeration. Of course, the cryptanalyst may employ some cryptographic tricks to reduce the search space. But this can be taken care of by changing the PMF slightly while doing the encoding, making it virtually impossible to break. This is because there are infinite possible PMFs and each PMF can have as many as n! possible codeword sets. However we do not address this problem here since it is beyond the scope of this paper. 3 New coding As mentioned in the Introduction, the expected length of Shannon-Fano-Elias codes L SF E satisfies H + L SF E < H + 2. We see that L SF E is always larger than L Huffman, which has a range of H L Huffman < H +. Our motivation of this research is to develop a new coding algorithm that has smaller length than that of Shannon-Fano-Elias coding, while the security level remains the same, i.e., requiring exponential time to break. In this section we propose such a code. In the next section we show that the expected length of our new code satisfies H L ours < H + 2. Take the source symbol set X = {x, x 2,..., x n }. Let the probability of symbol x i be. The encoding is based on a binary tree, with left branches denoting 0 and right branches denoting. The root is at level 0. Levels of the binary tree indicate lengths of codewords. The function avail(j) denotes the number of leaves available to be chosen as a codeword at level j of the tree. Before encoding starts, avail(j) is initialized to 2 j for j. Algorithm 2 gives the process of assigning a codeword for symbol x i ( i n), based on the sum of the i probabilities p(x ) through p(x i ) of symbols x through x i, and 2

13 Algorithm 2 Assign a codeword for symbol x i. Input: p(x ), p(x 2 ),..., such that > 0 and i p(x k). Output: codeword(x i ). i #ofcodewordsrequired p(x k) length(x i ) min{j : avail(j) #ofcodewordsrequired} codeword(x i ) leftmost available leaf on level length(x i ) if avail(s)mod2 = 0 for length(x i ) s then b else b min{j : avail(j)mod2 = and avail(s)mod2 = 0 for length(x i ) s j + } end if for length(x i ) m b do avail(m) avail(m) end for for m > length(x i ) do avail(m) avail(m) 2 m length(x i) end for the probability of symbol x i. Algorithm 2 consists of two steps: the first step (first 3 lines) is for assigning a codeword for x i. The second step is for updating function avail(j). In the first step, when assigning a codeword for x i ( i < n), it is guaranteed that length(x i ) length(x i ) for symbol x i (i < i n) with probability p(x i ). This strategy always reserves just enough leaves (codewords) for potential symbols (symbols that have not been encountered yet) with high probabilities, thus the expected length is reduced as much as possible. To achieve this, we first evaluate the required number of codewords at the level to which x i will belong (i.e., level length(x i )). This number is i #of CodewordsRequired = p(x k) i + + = p(x k) where i p(x k) is the maximum number of symbols with the same or larger probability than that may be encountered in the future. We guarantee that they will be assigned codewords not longer than the one assigned to x i. The first + means one node is needed 3 (4)

14 for x i. The second + means one node at level length(x i ) must be reserved for the future symbols with probabilities less than. The decedents of this node will be codewords for those symbols. Algorithm 2 then checks the function avail(j) and finds the smallest j that satisfies avail(j) #of CodewordsRequired. Any leaf on level j can be the codeword for x i. Without loss of generality Algorithm 2 chooses the leftmost one. In the second step, Algorithm 2 updates function avail(j) according to the following rule: the nodes that are parents of the leaf that is used for coding x i are no longer available, and the children of the leaf that is used for coding x i are no longer available either. The whole process for encoding a source is given in Figure. Table 2 shows a simple example. The source contains 5 symbols. The probability p(x ) of the first symbol is 0.20, thus #ofcodewordsrequired = 0.20 = 5. Since initially avail(2) = 4 < 5 and avail(3) = 8 5, we will assign a 3-bit codeword to x. According to the algorithm the leftmost available node at level-3 of the tree, i.e., 000, is chosen to be the codeword for x. Since 000 has been used, nodes that are prefix of 000 (i.e., 0 and 00) and have 000 as prefix (such as 0000, 000, 000 etc.) cannot be used for codewords anymore. Therefore the updated function avail(j) should be: avail() = ( can be a codeword), avail(2) = 3 ( 0, 0, and can be codewords), avail(3) = 7 (all 3-bit sequences expect 000 can be codewords), avail(4) = 4 (all 4-bit sequences expect 0000 and 000 can be codewords) etc. We continue to assign codewords for x 2 through x 5 and update avail(j) accordingly. The final codeword assignment is illustrated in Figure 2. Similar to Shannon-Fano-Elias coding, for the proposed coding technique the probabilities of symbols can be input in any order. Therefore there exist n! different permutations of symbols making the cryptanalysis by enumeration impractical. Another advantage of the proposed method is that, since no reordering is required, the probabilities are scanned only once. This makes the computation faster than using Huffman coding. Also notice that the 4

15 avail(j) = 2 j for j > 0 i = Get Input p(x ),p(x 2 ),, to Algorithm 2. Algorithm 2 outputs the codeword for x i End No p(x ) + p(x 2 ) + + <? Yes i = i + Figure. Encoding Process 5

16 Table 2: An Example of the Construction of the New Code i #ofcodewordrequired Length(x i ) Codeword avail(j) avail() = 2 avail(2) = 4 avail(3) = 8 avail(4) = avail() = avail(2) = 3 avail(3) = 7 avail(4) = avail() = avail(2) = 2 avail(3) = 5 avail(4) = avail() = avail(2) = 2 avail(3) = 4 avail(4) = avail() = 0 avail(2) = avail(3) = 2 avail(4) = avail() = 0 avail(2) = 0 avail(3) = 0 avail(4) = 0 6

17 0 x 2 x 4 x 5 x x 3 Figure 2. Example for the Encoding Process 7

18 probabilities of symbols that have occurred till the current symbol are needed to find the codeword for the current symbol. The probabilities of the symbols that have not yet occurred are not needed. The property could be very useful if the symbol set is large. 4 Properties In this section we first show some important properties of our new code. We then prove that the expected length satisfies the following inequality: H L ours < H + 2. (5) Theorem. The proposed codes are prefix codes. Proof. From the construction rule we see that only leaves on the binary tree are used for codewords. Internal nodes are not used for codewords. Theorem 2. Breaking a file encoded with the proposed codes by enumeration requires exponential time. Proof. The n source symbols can be input to the proposed encoding algorithm with any order. Therefore the encoding process could result in as many as n! different codeword sets. Since n! is an exponential quantity, examining all of them by exhaustive search requires exponential time. Let length(x i ) denote the length of the codeword for symbol x i ( i n). Suppose we are now assigning a codeword for symbol x i. For the previous i symbols, we have length(x k ) {j, j 2,..., j u } for k i. Without loss of generality let 0 < j < j 2 < < j u. Suppose out of the i symbols have a length of j k where k u. Obviously i = u. 8

19 Let x and y be real numbers. The following two propositions are used in the proof of Lemma 3.. If x < y then x < y If x y then x y. Lemma 3. For i n, log 2 i p(x k) u length(x i) log 2 p(x i + ) u. (6) i p(x k) Proof. By the code construction rule, before a codeword is assigned to x i, length(x i ) satisfies avail(length(x i ) ) < #ofcodewordsrequired avail(length(x i )). (7) where #of CodewordsRequired = i p(x k). We also have avail(length(xi )) avail(length(x i ) ) =. (8) 2 To find avail(length(x i )) and avail(length(x i ) ), three cases will be discussed. In Case, the length of the codeword for x i is smaller than the shortest codeword among x through x i. In Case 2, the length of the codeword for x i is larger than the shortest codeword among x through x i, and smaller than the longest codeword among x through x i. In Case 3, the length of the codeword for x i is larger than the longest codeword among x through x i. Case. length(x i ) < j. We need to compute the number of available codewords at level length(x i ) before assigning a codeword to x i. In order to do this, we first find the number of nodes at level length(x i ) that cannot be used as codewords because they are either codewords themselves or they are parents of nodes representing codewords at levels lower 9

20 than length(x i ). The lowest level of the tree is j u. The c u nodes at level j u are equivalent to nodes at level j u not being available. This number is added to c u, i.e., the c u 2 ju j u number of codewords at level j u. Likewise, the c u nodes at level j u and the c u nodes cu 2 at level j u are equivalent to ju j +c u u nodes at level j u 2 not being available. We 2 j u j u 2 continue computing like this. Finally we find that the i codewords for symbols x through x i are equivalent to cu 2 ju j +c u u 2 j u j u 2 +c u 2 2 j u 2 j u 3 2 j length(x i ) + c nodes at level length(x i ) not being available. Therefore avail(length(x i )) = 2 length(xi) u = 2 length(xi) Solving (0), (8) and (7) we get (6). = 2 length(x i) ( cu 2 ju j +c u u 2 j u j u 2 +c u 2 2 j u 2 j u 3 length(x i ) u 2 j length(x i ) ) + c (9). (0) Case 2. j r length(x i ) < j r+ where r u. If j k length(x i ) for k u, then the nodes at level j k are equivalent to 2 length(x i) j k nodes at level length(xi ) not being 20

21 available. Before assigning a codeword to x i, we have avail(length(x i )) = 2 length(x i) = 2 length(x i) = 2 length(x i) ( Solving (2), (8) and (7) we get (6). r 2 length(x i) j k r u 2 length(x i) j k u ) cu 2 ju j +c u u 2 j u j u 2 +c u 2 k=r+ 2 j u 2 j u 3 2 j r+ length(x i ) length(x i ) + c r+ (). (2) Case 3. length(x i ) j u. Before assigning a codeword to x i, we have Solving (3), (8) and (7) we get (6). avail(length(x i )) = 2 length(x i) u 2 length(x i) j k. (3) Lemma 4. For i n, i p(x k ) u. (4) Proof. We use induction. For the first symbol x, we have 2 length(x) < and hence length(x ) = p(x ) 2 length(x ) log 2 log p(x ) 2 p(x ). (5) This implies that p(x ) 2 length(x ). (6) 2

22 Assume for some integer I with I n we have I p(x k ) u. (7) where u = I. What will be shown is that I p(x k) u where u = I. Notice that I p(x k) = I p(x k) + p(x I ) and u Therefore we will prove that I p(x k ) + p(x I ) u + = u + 2 length(x I ). 2 length(x I). (8) By (7) and I p(x k) + p(x I ) we have ( I ) ( ) u I p(x k ) p(x I ) p(x k ) 0, (9) which is equivalent to p(x I ) because p(x I ) 0, u length(x I ) log 2 By (20) and (2), I ( p(x k) u I ( p(x k) p(x I ) u ) p(x I ) + I p(x k) u 0, and p(x I) + I p(x k) u ) and hence 2 length(x I) p(x I ) 2 p(x I I) + length(x I) I ( p(x k) u This implies (8). By induction we have proved (4). p(x k ) u (20) 0. By Lemma 3 ). (2). (22) Theorem 5. For i n, length(x i ) log 2 +. (23) 22

23 Proof. It is obvious that + i p(x k). By Lemma 4, we have + u and +2 u + i p(x k). This implies i p(x k)+ 2 2 u It is obvious that u 0 and p(x i) 0. Hence. i p(x k) + u 2 (24) and Therefore log 2 i p(x k) + u log 2 p(x i + ) u i p(x k) 2 log 2 = log 2 +. (25) log 2 +. (26) By Lemma 3 the left-hand-side is the upper bound of length(x i ). This proves (23). Theorem 6. For the same source, L ours < L SF E. (27) Proof. The length of the codeword for x is given by (5). For 2 i n, by Theorem 5, length(x i ) log 2 +. (28) Recall that for Shannon-Fano-Elias codes the codeword length is given by (3). Therefore L ours = n p(x k )length(x k ) < n ( ) p(x k ) log 2 + = L SF E. (29) p(x k ) Theorem 7. For a source with entropy H, H L ours < H + 2. (30) 23

24 Proof. By the definition of source entropy H = n p(x k) log 2 p(x k, it is straightforward ) from (29) that L ours < H + 2. According to Shannon s basic theorems introduced in [9], any prefix code with expected length L must satisfy L H. However, not all prefix codes can have an expected length equal to H. For example, the lower bound of Shannon-Fano-Elias codes is H + instead of H. We now show that, for our new code, if = 2 t i for i n where all t i are integers such that 0 < t t 2... t n, then L ours = H. Conditions = 2 t i and 0 < t t 2... t n imply that i p(x k) = i p(x k). (3) According to the construction rule, if p(x i ) for i < i n then length(x i ) length(x i ). Therefore this case belongs to Case 3 of Lemma 3. Then 2 length(x i ) u 2 length(x i) j k < i p(x k) u 2 length(xi) 2 length(x i) j k (32) 2 and thus and i length(x i ) = log p(x k) 2 ( u jk ). (33) We use induction. For i =, according to (5) we have length(x ) = log 2 p(x ) = t. Assume length(x i ) = log 2 for i I < n. Then p(x i) = 2 length(x i ) for i I I p(x k ) = where u = I. By (33) and (34), length(x I+ ) = log 2 p(x I+ ) u (34) = log 2 p(x I+ ) = t I+. (35) By induction length(x i ) = log 2 = t i for i n. Hence in this case L ours = H. Therefore the bounds of the expected length of the proposed code is given by (30). 24

25 Table 3: Shannon-Fano-Elias Code and Our New Code for the English Alphabet A to Z x i SFE codes new codes x i SFE codes new codes A N B O C P D Q E R F S G T H U I V J W K X L Y M Z Examples We first consider codes for the English alphabet with order A to Z. The codeword assignments for Shannon-Fano-Elias coding and the proposed coding are shown in Table 3. In this example, the source entropy is , the expected length of Huffman codes is , the expected length of Shannon-Fano-Elias codes is , and the expected length of our proposed code is We now consider codes for the English alphabet with order Z to A. The codeword assignments for Shannon-Fano-Elias coding and the proposed coding are shown in Table 4. In this case the expected length of the proposed codes is We see that the expected length of the proposed codes is much smaller than that of Shannon-Fano-Elias codes, and is pretty close to that of Huffman codes. Given that the new codes result in much better security than Huffman codes, the cost of code length increases 25

26 Table 4: Shannon-Fano-Elias Code and Our New Code for the English Alphabet Z to A x i SFE codes new codes x i SFE codes new codes Z M Y L X K W J V I U H T G S F R E Q D P C O B N A is reasonable. 6 Conclusion We have developed an innovative construction of variable-length prefix codes. To construct the proposed codes, the probabilities of source symbols do not need to be placed in nondecreasing order, unlike Huffman coding. We have shown that breaking a file encoded using the proposed code requires exponential time. Therefore compression and encryption can be combined. However, this is not the case for Huffman code, which proves to be vulnerable if the cryptanalyst knows the construction rule and PMF. The bounds of the expected code length of the new code are derived. It is shown that the proposed codes always have shorter length than that of Shannon-Fano-Elias codes. 26

27 We remark that using compression for data encryption might not be secure enough for those applications that require high level secrecy. However for some common applications, such as protecting copyrighted publications from illegal accesses, it is convenient to combine compression and encryption in order to make the cryptanalysis as expensive as the protected materials themselves. In those applications the new coding technique is very useful. References [] D. A. Huffman, A method for the construction of minimum redundency codes, Proc. IRE, vol. 40, pp , Sept [2] F. Robin, Cryptographic aspects of data compression codes, Cryptologia, vol. 3, no. 4, pp , 979. [3] S. T. Klein, A. Bookstein, and S. Deerwestern, Storing text retrieval systems on CD- ROM: compression and encryption considerations, ACM Trans. Inform. Systems, vol. 7, no. 3, pp , Jul., 989. [4] H. Lekatsas, J. Henkel, S. Chakradhar, and V. Jakkula, Cypress: compression and encryption of data and code for embedded multimedia systems, IEEE Proc. Design and Test of Computers, vol. 2, no. 5, pp , May [5] D. W. Gillman, M. Mohtashemi, and R. L. Rivest, On breaking a Huffman code, IEEE Trans. Inform. Theory, vol. 42, no. 3, pp , May 996. [6] A. S. Fraenkel and S. T. Klein, Complexity aspects of guessing prefix codes, Algorithmica, vol. 2, pp ,

28 [7] J. Yang, L. Gao, and Y. Zhang, Improving memory encryption performance in secure processors, IEEE Trans. Computers, vol. 54, no. 5, pp , May, [8] T. M. Cover and J. A. Thomas, Elements of Information Theory, John Wiley & Sons, 99. [9] C. E. Shannon, A mathematical theory of communication, Bell Sys, Tech. J., vol. 27, pp , , Jul.-Oct., 948. [0] N. Abramson, Information Theory and Coding, New York: McGraw-Hill, 963. [] B. McMillan, Two inequalities implied by unique decipherability, IRE Trans. Inform. Theory, pp. 5-6, Dec [2] K. Laković and J. Villasenor, On design of error-correcting reversible variable length codes, IEEE Commun. Letters, vol. 6, no. 8, pp , Aug

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding SIGNAL COMPRESSION Lecture 3 4.9.2007 Shannon-Fano-Elias Codes and Arithmetic Coding 1 Shannon-Fano-Elias Coding We discuss how to encode the symbols {a 1, a 2,..., a m }, knowing their probabilities,

More information

Lecture 16. Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code

Lecture 16. Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code Lecture 16 Agenda for the lecture Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code Variable-length source codes with error 16.1 Error-free coding schemes 16.1.1 The Shannon-Fano-Elias

More information

3F1 Information Theory, Lecture 3

3F1 Information Theory, Lecture 3 3F1 Information Theory, Lecture 3 Jossy Sayir Department of Engineering Michaelmas 2011, 28 November 2011 Memoryless Sources Arithmetic Coding Sources with Memory 2 / 19 Summary of last lecture Prefix-free

More information

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE General e Image Coder Structure Motion Video x(s 1,s 2,t) or x(s 1,s 2 ) Natural Image Sampling A form of data compression; usually lossless, but can be lossy Redundancy Removal Lossless compression: predictive

More information

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

10-704: Information Processing and Learning Fall Lecture 10: Oct 3 0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 0: Oct 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy of

More information

Information Theory and Statistics Lecture 2: Source coding

Information Theory and Statistics Lecture 2: Source coding Information Theory and Statistics Lecture 2: Source coding Łukasz Dębowski ldebowsk@ipipan.waw.pl Ph. D. Programme 2013/2014 Injections and codes Definition (injection) Function f is called an injection

More information

Data Compression. Limit of Information Compression. October, Examples of codes 1

Data Compression. Limit of Information Compression. October, Examples of codes 1 Data Compression Limit of Information Compression Radu Trîmbiţaş October, 202 Outline Contents Eamples of codes 2 Kraft Inequality 4 2. Kraft Inequality............................ 4 2.2 Kraft inequality

More information

On the Cost of Worst-Case Coding Length Constraints

On the Cost of Worst-Case Coding Length Constraints On the Cost of Worst-Case Coding Length Constraints Dror Baron and Andrew C. Singer Abstract We investigate the redundancy that arises from adding a worst-case length-constraint to uniquely decodable fixed

More information

Chapter 5: Data Compression

Chapter 5: Data Compression Chapter 5: Data Compression Definition. A source code C for a random variable X is a mapping from the range of X to the set of finite length strings of symbols from a D-ary alphabet. ˆX: source alphabet,

More information

3F1 Information Theory, Lecture 3

3F1 Information Theory, Lecture 3 3F1 Information Theory, Lecture 3 Jossy Sayir Department of Engineering Michaelmas 2013, 29 November 2013 Memoryless Sources Arithmetic Coding Sources with Memory Markov Example 2 / 21 Encoding the output

More information

Lecture 3 : Algorithms for source coding. September 30, 2016

Lecture 3 : Algorithms for source coding. September 30, 2016 Lecture 3 : Algorithms for source coding September 30, 2016 Outline 1. Huffman code ; proof of optimality ; 2. Coding with intervals : Shannon-Fano-Elias code and Shannon code ; 3. Arithmetic coding. 1/39

More information

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code Chapter 3 Source Coding 3. An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code 3. An Introduction to Source Coding Entropy (in bits per symbol) implies in average

More information

Entropy as a measure of surprise

Entropy as a measure of surprise Entropy as a measure of surprise Lecture 5: Sam Roweis September 26, 25 What does information do? It removes uncertainty. Information Conveyed = Uncertainty Removed = Surprise Yielded. How should we quantify

More information

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

Multimedia Communications. Mathematical Preliminaries for Lossless Compression Multimedia Communications Mathematical Preliminaries for Lossless Compression What we will see in this chapter Definition of information and entropy Modeling a data source Definition of coding and when

More information

Lecture 4 : Adaptive source coding algorithms

Lecture 4 : Adaptive source coding algorithms Lecture 4 : Adaptive source coding algorithms February 2, 28 Information Theory Outline 1. Motivation ; 2. adaptive Huffman encoding ; 3. Gallager and Knuth s method ; 4. Dictionary methods : Lempel-Ziv

More information

UNIT I INFORMATION THEORY. I k log 2

UNIT I INFORMATION THEORY. I k log 2 UNIT I INFORMATION THEORY Claude Shannon 1916-2001 Creator of Information Theory, lays the foundation for implementing logic in digital circuits as part of his Masters Thesis! (1939) and published a paper

More information

Information Theory with Applications, Math6397 Lecture Notes from September 30, 2014 taken by Ilknur Telkes

Information Theory with Applications, Math6397 Lecture Notes from September 30, 2014 taken by Ilknur Telkes Information Theory with Applications, Math6397 Lecture Notes from September 3, 24 taken by Ilknur Telkes Last Time Kraft inequality (sep.or) prefix code Shannon Fano code Bound for average code-word length

More information

Generalized Kraft Inequality and Arithmetic Coding

Generalized Kraft Inequality and Arithmetic Coding J. J. Rissanen Generalized Kraft Inequality and Arithmetic Coding Abstract: Algorithms for encoding and decoding finite strings over a finite alphabet are described. The coding operations are arithmetic

More information

Outline. Computer Science 418. Number of Keys in the Sum. More on Perfect Secrecy, One-Time Pad, Entropy. Mike Jacobson. Week 3

Outline. Computer Science 418. Number of Keys in the Sum. More on Perfect Secrecy, One-Time Pad, Entropy. Mike Jacobson. Week 3 Outline Computer Science 48 More on Perfect Secrecy, One-Time Pad, Mike Jacobson Department of Computer Science University of Calgary Week 3 2 3 Mike Jacobson (University of Calgary) Computer Science 48

More information

An Approximation Algorithm for Constructing Error Detecting Prefix Codes

An Approximation Algorithm for Constructing Error Detecting Prefix Codes An Approximation Algorithm for Constructing Error Detecting Prefix Codes Artur Alves Pessoa artur@producao.uff.br Production Engineering Department Universidade Federal Fluminense, Brazil September 2,

More information

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018 Please submit the solutions on Gradescope. EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018 1. Optimal codeword lengths. Although the codeword lengths of an optimal variable length code

More information

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code Chapter 2 Date Compression: Source Coding 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code 2.1 An Introduction to Source Coding Source coding can be seen as an efficient way

More information

Coding of memoryless sources 1/35

Coding of memoryless sources 1/35 Coding of memoryless sources 1/35 Outline 1. Morse coding ; 2. Definitions : encoding, encoding efficiency ; 3. fixed length codes, encoding integers ; 4. prefix condition ; 5. Kraft and Mac Millan theorems

More information

Lecture 1 : Data Compression and Entropy

Lecture 1 : Data Compression and Entropy CPS290: Algorithmic Foundations of Data Science January 8, 207 Lecture : Data Compression and Entropy Lecturer: Kamesh Munagala Scribe: Kamesh Munagala In this lecture, we will study a simple model for

More information

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd 4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd Data Compression Q. Given a text that uses 32 symbols (26 different letters, space, and some punctuation characters), how can we

More information

Shannon-Fano-Elias coding

Shannon-Fano-Elias coding Shannon-Fano-Elias coding Suppose that we have a memoryless source X t taking values in the alphabet {1, 2,..., L}. Suppose that the probabilities for all symbols are strictly positive: p(i) > 0, i. The

More information

Huffman Coding. C.M. Liu Perceptual Lab, College of Computer Science National Chiao-Tung University

Huffman Coding. C.M. Liu Perceptual Lab, College of Computer Science National Chiao-Tung University Huffman Coding C.M. Liu Perceptual Lab, College of Computer Science National Chiao-Tung University http://www.csie.nctu.edu.tw/~cmliu/courses/compression/ Office: EC538 (03)573877 cmliu@cs.nctu.edu.tw

More information

lossless, optimal compressor

lossless, optimal compressor 6. Variable-length Lossless Compression The principal engineering goal of compression is to represent a given sequence a, a 2,..., a n produced by a source as a sequence of bits of minimal possible length.

More information

1 Introduction to information theory

1 Introduction to information theory 1 Introduction to information theory 1.1 Introduction In this chapter we present some of the basic concepts of information theory. The situations we have in mind involve the exchange of information through

More information

Data Compression Techniques

Data Compression Techniques Data Compression Techniques Part 2: Text Compression Lecture 5: Context-Based Compression Juha Kärkkäinen 14.11.2017 1 / 19 Text Compression We will now look at techniques for text compression. These techniques

More information

CS4800: Algorithms & Data Jonathan Ullman

CS4800: Algorithms & Data Jonathan Ullman CS4800: Algorithms & Data Jonathan Ullman Lecture 22: Greedy Algorithms: Huffman Codes Data Compression and Entropy Apr 5, 2018 Data Compression How do we store strings of text compactly? A (binary) code

More information

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria Source Coding Master Universitario en Ingeniería de Telecomunicación I. Santamaría Universidad de Cantabria Contents Introduction Asymptotic Equipartition Property Optimal Codes (Huffman Coding) Universal

More information

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet) Compression Motivation Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet) Storage: Store large & complex 3D models (e.g. 3D scanner

More information

Lec 03 Entropy and Coding II Hoffman and Golomb Coding

Lec 03 Entropy and Coding II Hoffman and Golomb Coding CS/EE 5590 / ENG 40 Special Topics Multimedia Communication, Spring 207 Lec 03 Entropy and Coding II Hoffman and Golomb Coding Zhu Li Z. Li Multimedia Communciation, 207 Spring p. Outline Lecture 02 ReCap

More information

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding SIGNAL COMPRESSION Lecture 7 Variable to Fix Encoding 1. Tunstall codes 2. Petry codes 3. Generalized Tunstall codes for Markov sources (a presentation of the paper by I. Tabus, G. Korodi, J. Rissanen.

More information

Cryptography and Network Security Prof. D. Mukhopadhyay Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Cryptography and Network Security Prof. D. Mukhopadhyay Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Cryptography and Network Security Prof. D. Mukhopadhyay Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Module No. # 01 Lecture No. # 08 Shannon s Theory (Contd.)

More information

4. Quantization and Data Compression. ECE 302 Spring 2012 Purdue University, School of ECE Prof. Ilya Pollak

4. Quantization and Data Compression. ECE 302 Spring 2012 Purdue University, School of ECE Prof. Ilya Pollak 4. Quantization and Data Compression ECE 32 Spring 22 Purdue University, School of ECE Prof. What is data compression? Reducing the file size without compromising the quality of the data stored in the

More information

Source Coding Techniques

Source Coding Techniques Source Coding Techniques. Huffman Code. 2. Two-pass Huffman Code. 3. Lemple-Ziv Code. 4. Fano code. 5. Shannon Code. 6. Arithmetic Code. Source Coding Techniques. Huffman Code. 2. Two-path Huffman Code.

More information

Optimum Binary-Constrained Homophonic Coding

Optimum Binary-Constrained Homophonic Coding Optimum Binary-Constrained Homophonic Coding Valdemar C. da Rocha Jr. and Cecilio Pimentel Communications Research Group - CODEC Department of Electronics and Systems, P.O. Box 7800 Federal University

More information

Motivation for Arithmetic Coding

Motivation for Arithmetic Coding Motivation for Arithmetic Coding Motivations for arithmetic coding: 1) Huffman coding algorithm can generate prefix codes with a minimum average codeword length. But this length is usually strictly greater

More information

PART III. Outline. Codes and Cryptography. Sources. Optimal Codes (I) Jorge L. Villar. MAMME, Fall 2015

PART III. Outline. Codes and Cryptography. Sources. Optimal Codes (I) Jorge L. Villar. MAMME, Fall 2015 Outline Codes and Cryptography 1 Information Sources and Optimal Codes 2 Building Optimal Codes: Huffman Codes MAMME, Fall 2015 3 Shannon Entropy and Mutual Information PART III Sources Information source:

More information

Data Compression Techniques

Data Compression Techniques Data Compression Techniques Part 1: Entropy Coding Lecture 4: Asymmetric Numeral Systems Juha Kärkkäinen 08.11.2017 1 / 19 Asymmetric Numeral Systems Asymmetric numeral systems (ANS) is a recent entropy

More information

Information and Entropy

Information and Entropy Information and Entropy Shannon s Separation Principle Source Coding Principles Entropy Variable Length Codes Huffman Codes Joint Sources Arithmetic Codes Adaptive Codes Thomas Wiegand: Digital Image Communication

More information

Tight Upper Bounds on the Redundancy of Optimal Binary AIFV Codes

Tight Upper Bounds on the Redundancy of Optimal Binary AIFV Codes Tight Upper Bounds on the Redundancy of Optimal Binary AIFV Codes Weihua Hu Dept. of Mathematical Eng. Email: weihua96@gmail.com Hirosuke Yamamoto Dept. of Complexity Sci. and Eng. Email: Hirosuke@ieee.org

More information

U Logo Use Guidelines

U Logo Use Guidelines COMP2610/6261 - Information Theory Lecture 15: Arithmetic Coding U Logo Use Guidelines Mark Reid and Aditya Menon logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature

More information

Tight Bounds on Minimum Maximum Pointwise Redundancy

Tight Bounds on Minimum Maximum Pointwise Redundancy Tight Bounds on Minimum Maximum Pointwise Redundancy Michael B. Baer vlnks Mountain View, CA 94041-2803, USA Email:.calbear@ 1eee.org Abstract This paper presents new lower and upper bounds for the optimal

More information

Reserved-Length Prefix Coding

Reserved-Length Prefix Coding Reserved-Length Prefix Coding Michael B. Baer vlnks Mountain View, CA 94041-2803, USA Email:.calbear@ 1eee.org Abstract Huffman coding finds an optimal prefix code for a given probability mass function.

More information

Lecture 6: Kraft-McMillan Inequality and Huffman Coding

Lecture 6: Kraft-McMillan Inequality and Huffman Coding EE376A/STATS376A Information Theory Lecture 6-0/25/208 Lecture 6: Kraft-McMillan Inequality and Huffman Coding Lecturer: Tsachy Weissman Scribe: Akhil Prakash, Kai Yee Wan In this lecture, we begin with

More information

The simple ideal cipher system

The simple ideal cipher system The simple ideal cipher system Boris Ryabko February 19, 2001 1 Prof. and Head of Department of appl. math and cybernetics Siberian State University of Telecommunication and Computer Science Head of Laboratory

More information

Intro to Information Theory

Intro to Information Theory Intro to Information Theory Math Circle February 11, 2018 1. Random variables Let us review discrete random variables and some notation. A random variable X takes value a A with probability P (a) 0. Here

More information

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 13 Competitive Optimality of the Shannon Code So, far we have studied

More information

EECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have

EECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have EECS 229A Spring 2007 * * Solutions to Homework 3 1. Problem 4.11 on pg. 93 of the text. Stationary processes (a) By stationarity and the chain rule for entropy, we have H(X 0 ) + H(X n X 0 ) = H(X 0,

More information

CMPT 365 Multimedia Systems. Lossless Compression

CMPT 365 Multimedia Systems. Lossless Compression CMPT 365 Multimedia Systems Lossless Compression Spring 2017 Edited from slides by Dr. Jiangchuan Liu CMPT365 Multimedia Systems 1 Outline Why compression? Entropy Variable Length Coding Shannon-Fano Coding

More information

Modified Berlekamp-Massey algorithm for approximating the k-error linear complexity of binary sequences

Modified Berlekamp-Massey algorithm for approximating the k-error linear complexity of binary sequences Loughborough University Institutional Repository Modified Berlekamp-Massey algorithm for approximating the k-error linear complexity of binary sequences This item was submitted to Loughborough University's

More information

CSCI 2570 Introduction to Nanocomputing

CSCI 2570 Introduction to Nanocomputing CSCI 2570 Introduction to Nanocomputing Information Theory John E Savage What is Information Theory Introduced by Claude Shannon. See Wikipedia Two foci: a) data compression and b) reliable communication

More information

Kolmogorov complexity ; induction, prediction and compression

Kolmogorov complexity ; induction, prediction and compression Kolmogorov complexity ; induction, prediction and compression Contents 1 Motivation for Kolmogorov complexity 1 2 Formal Definition 2 3 Trying to compute Kolmogorov complexity 3 4 Standard upper bounds

More information

Reserved-Length Prefix Coding

Reserved-Length Prefix Coding Reserved-Length Prefix Coding Michael B. Baer Ocarina Networks 42 Airport Parkway San Jose, California 95110-1009 USA Email:icalbear@ 1eee.org arxiv:0801.0102v1 [cs.it] 30 Dec 2007 Abstract Huffman coding

More information

Asymptotic redundancy and prolixity

Asymptotic redundancy and prolixity Asymptotic redundancy and prolixity Yuval Dagan, Yuval Filmus, and Shay Moran April 6, 2017 Abstract Gallager (1978) considered the worst-case redundancy of Huffman codes as the maximum probability tends

More information

Text Compression. Jayadev Misra The University of Texas at Austin December 5, A Very Incomplete Introduction to Information Theory 2

Text Compression. Jayadev Misra The University of Texas at Austin December 5, A Very Incomplete Introduction to Information Theory 2 Text Compression Jayadev Misra The University of Texas at Austin December 5, 2003 Contents 1 Introduction 1 2 A Very Incomplete Introduction to Information Theory 2 3 Huffman Coding 5 3.1 Uniquely Decodable

More information

Basic Principles of Lossless Coding. Universal Lossless coding. Lempel-Ziv Coding. 2. Exploit dependences between successive symbols.

Basic Principles of Lossless Coding. Universal Lossless coding. Lempel-Ziv Coding. 2. Exploit dependences between successive symbols. Universal Lossless coding Lempel-Ziv Coding Basic principles of lossless compression Historical review Variable-length-to-block coding Lempel-Ziv coding 1 Basic Principles of Lossless Coding 1. Exploit

More information

Lecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1)

Lecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1) 3- Mathematical methods in communication Lecture 3 Lecturer: Haim Permuter Scribe: Yuval Carmel, Dima Khaykin, Ziv Goldfeld I. REMINDER A. Convex Set A set R is a convex set iff, x,x 2 R, θ, θ, θx + θx

More information

Data Compression Techniques (Spring 2012) Model Solutions for Exercise 2

Data Compression Techniques (Spring 2012) Model Solutions for Exercise 2 582487 Data Compression Techniques (Spring 22) Model Solutions for Exercise 2 If you have any feedback or corrections, please contact nvalimak at cs.helsinki.fi.. Problem: Construct a canonical prefix

More information

The Optimal Fix-Free Code for Anti-Uniform Sources

The Optimal Fix-Free Code for Anti-Uniform Sources Entropy 2015, 17, 1379-1386; doi:10.3390/e17031379 OPEN ACCESS entropy ISSN 1099-4300 www.mdpi.com/journal/entropy Article The Optimal Fix-Free Code for Anti-Uniform Sources Ali Zaghian 1, Adel Aghajan

More information

Lecture 1: September 25, A quick reminder about random variables and convexity

Lecture 1: September 25, A quick reminder about random variables and convexity Information and Coding Theory Autumn 207 Lecturer: Madhur Tulsiani Lecture : September 25, 207 Administrivia This course will cover some basic concepts in information and coding theory, and their applications

More information

An introduction to basic information theory. Hampus Wessman

An introduction to basic information theory. Hampus Wessman An introduction to basic information theory Hampus Wessman Abstract We give a short and simple introduction to basic information theory, by stripping away all the non-essentials. Theoretical bounds on

More information

THIS paper is aimed at designing efficient decoding algorithms

THIS paper is aimed at designing efficient decoding algorithms IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 45, NO. 7, NOVEMBER 1999 2333 Sort-and-Match Algorithm for Soft-Decision Decoding Ilya Dumer, Member, IEEE Abstract Let a q-ary linear (n; k)-code C be used

More information

EE376A - Information Theory Midterm, Tuesday February 10th. Please start answering each question on a new page of the answer booklet.

EE376A - Information Theory Midterm, Tuesday February 10th. Please start answering each question on a new page of the answer booklet. EE376A - Information Theory Midterm, Tuesday February 10th Instructions: You have two hours, 7PM - 9PM The exam has 3 questions, totaling 100 points. Please start answering each question on a new page

More information

Chapter 2: Source coding

Chapter 2: Source coding Chapter 2: meghdadi@ensil.unilim.fr University of Limoges Chapter 2: Entropy of Markov Source Chapter 2: Entropy of Markov Source Markov model for information sources Given the present, the future is independent

More information

Chapter 9 Fundamental Limits in Information Theory

Chapter 9 Fundamental Limits in Information Theory Chapter 9 Fundamental Limits in Information Theory Information Theory is the fundamental theory behind information manipulation, including data compression and data transmission. 9.1 Introduction o For

More information

Halting and Equivalence of Program Schemes in Models of Arbitrary Theories

Halting and Equivalence of Program Schemes in Models of Arbitrary Theories Halting and Equivalence of Program Schemes in Models of Arbitrary Theories Dexter Kozen Cornell University, Ithaca, New York 14853-7501, USA, kozen@cs.cornell.edu, http://www.cs.cornell.edu/~kozen In Honor

More information

Multimedia. Multimedia Data Compression (Lossless Compression Algorithms)

Multimedia. Multimedia Data Compression (Lossless Compression Algorithms) Course Code 005636 (Fall 2017) Multimedia Multimedia Data Compression (Lossless Compression Algorithms) Prof. S. M. Riazul Islam, Dept. of Computer Engineering, Sejong University, Korea E-mail: riaz@sejong.ac.kr

More information

Optimal codes - I. A code is optimal if it has the shortest codeword length L. i i. This can be seen as an optimization problem. min.

Optimal codes - I. A code is optimal if it has the shortest codeword length L. i i. This can be seen as an optimization problem. min. Huffman coding Optimal codes - I A code is optimal if it has the shortest codeword length L L m = i= pl i i This can be seen as an optimization problem min i= li subject to D m m i= lp Gabriele Monfardini

More information

CSE 421 Greedy: Huffman Codes

CSE 421 Greedy: Huffman Codes CSE 421 Greedy: Huffman Codes Yin Tat Lee 1 Compression Example 100k file, 6 letter alphabet: File Size: ASCII, 8 bits/char: 800kbits 2 3 > 6; 3 bits/char: 300kbits better: 2.52 bits/char 74%*2 +26%*4:

More information

Introduction to algebraic codings Lecture Notes for MTH 416 Fall Ulrich Meierfrankenfeld

Introduction to algebraic codings Lecture Notes for MTH 416 Fall Ulrich Meierfrankenfeld Introduction to algebraic codings Lecture Notes for MTH 416 Fall 2014 Ulrich Meierfrankenfeld December 9, 2014 2 Preface These are the Lecture Notes for the class MTH 416 in Fall 2014 at Michigan State

More information

Introduction to Languages and Computation

Introduction to Languages and Computation Introduction to Languages and Computation George Voutsadakis 1 1 Mathematics and Computer Science Lake Superior State University LSSU Math 400 George Voutsadakis (LSSU) Languages and Computation July 2014

More information

3F1: Signals and Systems INFORMATION THEORY Examples Paper Solutions

3F1: Signals and Systems INFORMATION THEORY Examples Paper Solutions Engineering Tripos Part IIA THIRD YEAR 3F: Signals and Systems INFORMATION THEORY Examples Paper Solutions. Let the joint probability mass function of two binary random variables X and Y be given in the

More information

17.1 Binary Codes Normal numbers we use are in base 10, which are called decimal numbers. Each digit can be 10 possible numbers: 0, 1, 2, 9.

17.1 Binary Codes Normal numbers we use are in base 10, which are called decimal numbers. Each digit can be 10 possible numbers: 0, 1, 2, 9. ( c ) E p s t e i n, C a r t e r, B o l l i n g e r, A u r i s p a C h a p t e r 17: I n f o r m a t i o n S c i e n c e P a g e 1 CHAPTER 17: Information Science 17.1 Binary Codes Normal numbers we use

More information

PERFECTLY secure key agreement has been studied recently

PERFECTLY secure key agreement has been studied recently IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 45, NO. 2, MARCH 1999 499 Unconditionally Secure Key Agreement the Intrinsic Conditional Information Ueli M. Maurer, Senior Member, IEEE, Stefan Wolf Abstract

More information

Entropy and Ergodic Theory Lecture 3: The meaning of entropy in information theory

Entropy and Ergodic Theory Lecture 3: The meaning of entropy in information theory Entropy and Ergodic Theory Lecture 3: The meaning of entropy in information theory 1 The intuitive meaning of entropy Modern information theory was born in Shannon s 1948 paper A Mathematical Theory of

More information

On the redundancy of optimum fixed-to-variable length codes

On the redundancy of optimum fixed-to-variable length codes On the redundancy of optimum fixed-to-variable length codes Peter R. Stubley' Bell-Northern Reserch Abstract There has been much interest in recent years in bounds on the redundancy of Huffman codes, given

More information

Sources: The Data Compression Book, 2 nd Ed., Mark Nelson and Jean-Loup Gailly.

Sources: The Data Compression Book, 2 nd Ed., Mark Nelson and Jean-Loup Gailly. Lossless ompression Multimedia Systems (Module 2 Lesson 2) Summary: daptive oding daptive Huffman oding Sibling Property Update lgorithm rithmetic oding oding and ecoding Issues: OF problem, Zero frequency

More information

Quantum-inspired Huffman Coding

Quantum-inspired Huffman Coding Quantum-inspired Huffman Coding A. S. Tolba, M. Z. Rashad, and M. A. El-Dosuky Dept. of Computer Science, Faculty of Computers and Information Sciences, Mansoura University, Mansoura, Egypt. tolba_954@yahoo.com,

More information

COS597D: Information Theory in Computer Science October 19, Lecture 10

COS597D: Information Theory in Computer Science October 19, Lecture 10 COS597D: Information Theory in Computer Science October 9, 20 Lecture 0 Lecturer: Mark Braverman Scribe: Andrej Risteski Kolmogorov Complexity In the previous lectures, we became acquainted with the concept

More information

COMMUNICATION SCIENCES AND ENGINEERING

COMMUNICATION SCIENCES AND ENGINEERING COMMUNICATION SCIENCES AND ENGINEERING X. PROCESSING AND TRANSMISSION OF INFORMATION Academic and Research Staff Prof. Peter Elias Prof. Robert G. Gallager Vincent Chan Samuel J. Dolinar, Jr. Richard

More information

ECE 587 / STA 563: Lecture 5 Lossless Compression

ECE 587 / STA 563: Lecture 5 Lossless Compression ECE 587 / STA 563: Lecture 5 Lossless Compression Information Theory Duke University, Fall 2017 Author: Galen Reeves Last Modified: October 18, 2017 Outline of lecture: 5.1 Introduction to Lossless Source

More information

A Mathematical Theory of Communication

A Mathematical Theory of Communication A Mathematical Theory of Communication Ben Eggers Abstract This paper defines information-theoretic entropy and proves some elementary results about it. Notably, we prove that given a few basic assumptions

More information

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Information Theory and Distribution Modeling

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Information Theory and Distribution Modeling TTIC 31230, Fundamentals of Deep Learning David McAllester, April 2017 Information Theory and Distribution Modeling Why do we model distributions and conditional distributions using the following objective

More information

Introduction to information theory and coding

Introduction to information theory and coding Introduction to information theory and coding Louis WEHENKEL Set of slides No 5 State of the art in data compression Stochastic processes and models for information sources First Shannon theorem : data

More information

The Hackenbush Number System for Compression of Numerical Data

The Hackenbush Number System for Compression of Numerical Data INFORMATION AND CONTROL 26, 134--140 (1974) The Hackenbush Number System for Compression of Numerical Data E. R. BERLEKAMP* Electronics Research Laboratory, College of Engineering, University of California,

More information

CS6304 / Analog and Digital Communication UNIT IV - SOURCE AND ERROR CONTROL CODING PART A 1. What is the use of error control coding? The main use of error control coding is to reduce the overall probability

More information

1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H.

1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H. Problem sheet Ex. Verify that the function H(p,..., p n ) = k p k log p k satisfies all 8 axioms on H. Ex. (Not to be handed in). looking at the notes). List as many of the 8 axioms as you can, (without

More information

Information Theory, Statistics, and Decision Trees

Information Theory, Statistics, and Decision Trees Information Theory, Statistics, and Decision Trees Léon Bottou COS 424 4/6/2010 Summary 1. Basic information theory. 2. Decision trees. 3. Information theory and statistics. Léon Bottou 2/31 COS 424 4/6/2010

More information

Entropy Coding. Connectivity coding. Entropy coding. Definitions. Lossles coder. Input: a set of symbols Output: bitstream. Idea

Entropy Coding. Connectivity coding. Entropy coding. Definitions. Lossles coder. Input: a set of symbols Output: bitstream. Idea Connectivity coding Entropy Coding dd 7, dd 6, dd 7, dd 5,... TG output... CRRRLSLECRRE Entropy coder output Connectivity data Edgebreaker output Digital Geometry Processing - Spring 8, Technion Digital

More information

( c ) E p s t e i n, C a r t e r a n d B o l l i n g e r C h a p t e r 1 7 : I n f o r m a t i o n S c i e n c e P a g e 1

( c ) E p s t e i n, C a r t e r a n d B o l l i n g e r C h a p t e r 1 7 : I n f o r m a t i o n S c i e n c e P a g e 1 ( c ) E p s t e i n, C a r t e r a n d B o l l i n g e r 2 0 1 6 C h a p t e r 1 7 : I n f o r m a t i o n S c i e n c e P a g e 1 CHAPTER 17: Information Science In this chapter, we learn how data can

More information

2018/5/3. YU Xiangyu

2018/5/3. YU Xiangyu 2018/5/3 YU Xiangyu yuxy@scut.edu.cn Entropy Huffman Code Entropy of Discrete Source Definition of entropy: If an information source X can generate n different messages x 1, x 2,, x i,, x n, then the

More information

Information Theory CHAPTER. 5.1 Introduction. 5.2 Entropy

Information Theory CHAPTER. 5.1 Introduction. 5.2 Entropy Haykin_ch05_pp3.fm Page 207 Monday, November 26, 202 2:44 PM CHAPTER 5 Information Theory 5. Introduction As mentioned in Chapter and reiterated along the way, the purpose of a communication system is

More information

Autumn Coping with NP-completeness (Conclusion) Introduction to Data Compression

Autumn Coping with NP-completeness (Conclusion) Introduction to Data Compression Autumn Coping with NP-completeness (Conclusion) Introduction to Data Compression Kirkpatrick (984) Analogy from thermodynamics. The best crystals are found by annealing. First heat up the material to let

More information

Transducers for bidirectional decoding of prefix codes

Transducers for bidirectional decoding of prefix codes Transducers for bidirectional decoding of prefix codes Laura Giambruno a,1, Sabrina Mantaci a,1 a Dipartimento di Matematica ed Applicazioni - Università di Palermo - Italy Abstract We construct a transducer

More information

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1 Kraft s inequality An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if N 2 l i 1 Proof: Suppose that we have a tree code. Let l max = max{l 1,...,

More information

ECE 587 / STA 563: Lecture 5 Lossless Compression

ECE 587 / STA 563: Lecture 5 Lossless Compression ECE 587 / STA 563: Lecture 5 Lossless Compression Information Theory Duke University, Fall 28 Author: Galen Reeves Last Modified: September 27, 28 Outline of lecture: 5. Introduction to Lossless Source

More information