Using an innovative coding algorithm for data encryption
|
|
- Angela Woods
- 5 years ago
- Views:
Transcription
1 Using an innovative coding algorithm for data encryption Xiaoyu Ruan and Rajendra S. Katti Abstract This paper discusses the problem of using data compression for encryption. We first propose an algorithm for breaking a prefix-coded file by enumeration. Based on the algorithm, we respectively analyze the complexity of breaking Huffman codes and Shannon-Fano-Elias codes under the assumption that the cryptanalyst knows the code construction rule and the probability mass function of the source. It is shown that under this assumption Huffman codes are vulnerable but Shannon-Fano-Elias codes need exponential time to break. We then propose an innovative construction of variable-length prefix codes that have the same security as Shannon-Fano-Elias codes but the expected length is smaller. keywords Cryptography, data compression, Huffman codes, prefix codes, Shannon- Fano-Elias codes, variable-length codes. This work has been supported in part by the National Science Foundation under Grant CCR The authors are with the Department of Electrical and Computer Engineering, North Dakota State University, Fargo, ND , USA ( xiaoyu.ruan@ndsu.edu; rajendra.katti@ndsu.edu).
2 Introduction Huffman coding [] is one of the best-known compression techniques that produces optimal compression for any given probability distribution. The use of Huffman codes for encryption has been considered in [2] and [3]. These works were motivated by security requirements for storing a large textual database on a CD-ROM. The text of the database needed to be compressed for memory efficiency and encrypted to prevent illegal use of copyrighted material. Using variable-length prefix codes for encryption does not lead to absolute secrecy as might be required by some military applications but results in making the cryptanalysis difficult enough so that the cost of decryption exceeds any potential profit incurred by breaking the code. Another application where compression and encryption are needed is embedded multimedia systems [4], such systems require both data and programs to be compressed and encrypted and stored in main memory. Both data and programs are decrypted and decompressed after they enter the processor which is considered to be secure. In this case using compression as a means to achieve security can improve performance and memory requirement of the system. Increased security can be obtained by using simpler encryption methods in addition to using compression. In [5], it was shown that if the cryptanalyst knows whether the encoder is using an arbitrary Huffman code or a right-heavy Huffman code 2, but does not know the Probability mass function (PMF) of the source, then a Huffman coded file can be surprisingly difficult to cryptanalyze. However, this assumption of the cryptanalyst not knowing the PMF is invalid Whenever two subtrees are combined, an arbitrary decision may be made as to which subtree has greater total weight (probability). 2 The subtree of greater total weight is always made the right subtree. 2
3 especially for text data where the PMF is usually known. In this paper we assume that the cryptanalyst not only knows the construction procedure of the codes but also knows the PMF. Under this assumption, Huffman codes can be easily decoded and essentially provide no security. Fraenkel and Klein [6] have enhanced prefix codes by adding a short sequence of random bits to some of the codewords. They then show that decoding such a code is NP-complete. However the drawback of this method is that adding random bits increases the average length of the code. A better method would be to XOR the compressed sequence with a sequence of bits obtained by encrypting a seed as was done in [7]. This would maintain the compression properties and improve the security of the code. In this paper we consider using a variable-length prefix code for encryption. Although Huffman codes produce optimal compression, their encryption capabilities are not very good. Shannon-Fano-Elias codes [8] are a better candidate for encryption but their compression capabilities are inferior compared to Huffman codes. In this paper we show that it is more difficult to cryptanalyze Shannon-Fano-Elias codes than Huffman codes. We then propose a new code that is as good for encryption as Shannon-Fano-Elias codes but is almost as good as Huffman codes for compression. Assume that the information source puts out a sequence of symbols from the set X = {x, x 2,..., x n } with PMF {p(x ), p(x 2 ),..., p(x n )}. This implies that p(x = x i ) = where i =, 2,..., n. Each symbol in the sequence is then assigned a codeword such that the expected length of the code is small, and it is hard for an intruder to decode the sequence of codewords into the symbols being output by the source. The expected length is defined as L = n l(x i ), () i= 3
4 where l(x i ) denotes the length of the codeword for the symbol x i with probability. It is assumed that the intruder knows the PMF and the coding method. In Huffman coding, the probabilities of source symbols must be ordered non-increasingly before encoding. In other words, the probabilities of source symbols must be examined at least twice before proper codewords can be assigned to each symbol: once for reordering symbols and the other for assigning codewords. Huffman codes have an optimal expected length H L Huffman < H + where H represents the entropy of the source. Shannon mentioned an approach using the cumulative distribution function when describing the Shannon-Fano code [9]. Elias later came up with a recursive implementation for this idea. It is now known as Shannon-Fano-Elias coding. However Elias never published it. It was first introduced in a 963 information theory book by Abramson [0]. We now review the construction of Shannon-Fano-Elias codes. The modified cumulative distribution function is defined as i F () = p(x k ) + 2 p(x i). (2) F () represents the sum of probabilities of symbols x through x i plus half of the probability of x i. Notice that it is not required to order the symbols based on their probabilities. Therefore different orderings of symbols lead to different codes. Since the random variable is discrete, the cumulative function consists of steps of size. Therefore we can determine x i if we know F (). In general F () is a real number only expressible by an infinite number of bits. F () is rounded off to l SF E (x i ) = log + (3) bits, denoted by F () l SF E (x i ). It can be shown that the set of lengths l SF E(x i ) ( i n) satisfies the Kraft inequality [] and hence can be used to construct a uniquely decodable code. F (p(xi )) l SF E(x i ) is within the step corresponding to x i. Thus we can use the first 4
5 l SF E (x i ) bits of F () to describe x i. Shannon-Fano-Elias coding uses the binary form of F (p(xi )) l SF E(x i ) as the codeword for x i. The resulting codewords prove to be prefix-free. The expected length L SF E computed using () satisfies H + L SF E < H + 2. We remark that, compared to Huffman coding, in Shannon-Fano-Elias coding the probabilities of symbols can be arranged in any order. There exist n! different orders. This feature might be employed to increase the ambiguity of the codes and hence make the code difficult to break. For example, probabilities of the English alphabet occurring in literature [2] are shown in the second column of Table 3 or Table 4. The corresponding codewords for the alphabet ordered from A to Z are listed in the third column of Table 3 and that for the order Z to A are listed in the third column of Table 4. For the English alphabet there exist 26! different permutations of symbols. Table 3 and Table 4 show two out of the 26! cases. At a glance, one may wonder if the resulting codeword set for the reverse order of symbols is always the complement of the codeword set for the original order or not. In fact, this is not true. For example, for a 3-symbol source {s, s 2, s 3 }, let p(s ) = p(s 2 ) = 0.25 and p(s 3 ) = 0.5. If symbols are ordered as s, s 2, s 3, then the codewords are 00, 0, and for s, s 2, and s 3, respectively. If symbols are ordered as s 3, s 2, s, then the codewords are 0, 0, and for s 3, s 2, and s, respectively. The the rest of the paper is organized as follows. Section 2 gives an algorithm for breaking a prefix code by enumeration and analyzes the complexity of breaking a Huffman code and a Shannon-Fano-Elias code. Although Shannon-Fano-Elias coding results in better security than Huffman coding, as we will see later, it increases the expected code length thus making it inefficient. In Section 3 we propose an innovative coding algorithm that results in the same security as Shannon-Fano-Elias coding but the expected code length is less. In Section 4 we prove some properties of the proposed code. Numerical examples are shown in Section 5. Finally Section 6 concludes the paper. 5
6 2 Breaking a prefix code by enumeration Suppose the cryptanalyst wants to break a prefix-coded sequence. In order to do this his goal is to find out the codeword set used by the encoder. Once this is done then some efficient method can perhaps be utilized to match codewords and symbols. Assume that he has worked out several candidate codeword sets by some means. Furthermore, he has received a few encoded sequences. Now he needs to examine the candidate sets and find out the right one. Let S 0 be the encoded sequence to be analyzed. S 0 is one of the encoded sequences that the cryptanalyst has received. Variable S is initialized to S 0 and will be progressively updated by Algorithm. S is a binary sequence denoted by (b b 2 ). Algorithm scans S 0 and checks whether a candidate codeword set ϕ = {C, C 2,..., C n, C n } could be the one used for encoding. Without loss of generality we let length(c ) length(c 2 )... length(c n ) length(c n ) where function length(c i ) returns the length of codeword C i. Algorithm first searches S for the longest codeword and then progressively searches for shorter and shorter codewords. Suppose sequence C i occurs t (t > 0) times in S. Notice that an occurrence may overlap with others. The bits to the left of each of the t occurrences are checked. If none of them match any codeword in the candidate set, then each of the t found sequences, though the same as C i, is not actually a codeword, but either the combination of a suffix and a prefix of two codewords or an infix of a codeword. i.e., C i is not in the codeword set used for encoding. If for some of the t occurrences, the left neighboring bits match a codeword in the candidate set, then C i is indeed in the codeword set used for encoding. The rest of the occurrences are either combinations of a suffix and a prefix of two codewords or infix of a codeword. We then remove the occurrences of C i which have codewords to their left. The updated S is again searched for the next-longest codeword, i.e., C i+. The process 6
7 Algorithm Check if a candidate codeword set may be the one used for encoding or cannot be the one used for encoding. i S S 0 γ π while i n do β search for C i in S from left to right, t C i s are found if t = 0 then γ γ {C i } else for k t do p k index of the leftmost bit of the k th C i end for if p = then β β {} end if for k t do if p k length(c n ) then for length(c n ) j min{length(c i ), p k } do if (b pk jb pk j+... b pk 2b pk ) ϕ then β β {k} break end if end for end if end for if β = then γ γ {C i } else π π {C i } S removing the k th C i from S (for all k β) end if end if i i + end while if S is empty then return ( This codeword set may be the one used for encoding. Elements in π are in the codeword set used for encoding. Unable to decide if elements in γ are in the codeword set used for encoding or not. ) else return ( This codeword set is not the one used for encoding. ) end if 7
8 continues until all codewords in the candidate set are checked. At the end of this process, if S is empty then the candidate set could be the set for encoding, otherwise the candidate set is not the one used for encoding. Since the code is prefix-free, a codeword is not a prefix of any other codewords. Furthermore, a codeword must not be an infix or suffix of other codewords of equal or smaller length. This explains the principle and correctness of Algorithm. To illustrate the operations of Algorithm, we consider an example that uses Shannon- Fano-Elias coding. Table shows the source symbols and PMF. There are three symbols and thus 3! = 6 different ways in which they can be ordered. We define an order to be a key, which has been transmitted to legal receivers via private channels. Suppose one of the encoded sequences the cryptanalyst received is Firstly the codeword set {0, 0, 0}, corresponding to key ABC, is examined. There does not exist 0 in the sequence. Searching for 0 we find one occurrence: ˆ The leading bit is hatted. The bits to the left of 0 are neither 0 nor 0, so 0 is not in the codeword set used for encoding. Finally we search for 0 and find 5 occurrences: ˆ0ˆ0ˆ0ˆ00ˆ00ˆ0ˆ00000ˆ0000ˆ0ˆ0ˆ0ˆ00ˆ00000ˆ00000ˆ0. We then remove the 0s that have a valid codeword to their left. Except for the leftmost 0, the left neighboring bits of the removed 0s are also 0. The resulting sequence then is
9 Table : An Example for Algorithm Key x i Symbol Codeword ABC x A x 2 B x 3 C ACB x A x 2 C x 3 B BAC x B x 2 A x 3 C BCA x B x 2 C x 3 A CBA x C x 2 B x 3 A CAB x C x 2 A x 3 B
10 Since the process ends up with a non-empty sequence, Algorithm concludes that {0, 0, 0} is not the right set. Candidate sets {000, 0, 0}, {0, 00, 0}, {00, 00, 0}, and {0000, 0, 0} can be eliminated likewise. Next consider {0000, 00, 0} corresponding to key CBA occurs four times in the sequence: ˆ0000ˆ ˆ00000ˆ0000. Picking any one of the four, we find that the left neighboring bits are in {0000, 00, 0}. We then update S by removing all four occurrences of 0000: Searching for 00 we find 2 occurrences: ˆ0ˆ0ˆ0ˆ00ˆ00ˆ0ˆ00ˆ0ˆ0ˆ00ˆ0ˆ00. We then remove the 00s with valid codewords to their left. Except for the leftmost 00, the left neighboring bits of the removed 00s are either 00 or 0. The remaining sequence is made up of consecutive codewords 0, which will be removed. Finally an empty sequence results. Therefore {0000, 00, 0} must be the codeword set used for encoding since all of the other five candidate sets have been eliminated. The corresponding key is CBA. The plaintext is BAAABABCCBAABCAC. It is possible that more than one candidate set could be the one used for encoding. In this case, Algorithm only reduces the size of the candidate set space, but does not specify which is the right one. Therefore, the cryptanalyst may need more than one encoded sequence to make a correct judgement. He has to check each of the candidate sets that have passed 0
11 the examinations by Algorithm for previous sequences. This is performed until only one candidate set is left. For example, consider the encoded sequence 000 which can be split into: For this sequence, Algorithm eliminates candidate sets {0, 00, 0}, {00, 00, 0}, and {0000, 00, 0}. Each of the other three sets could be the one used for encoding because they all contain codewords 0 and 0. The size of the candidate set space is reduced from 6 to 3. In order to find out the right one, analysis of more sequences is necessary. The general process of breaking a prefix-coded sequence is:. Find all possible codeword sets (i.e., candidates). 2. For an encoded sequence S 0, check all candidates by repeatedly using Algorithm. 3. Record the candidates that may be the one used for encoding. 4. If more than one set results in step 3) then pick another encoded sequence S 0 and repeat steps 2) and 3) until only one candidate is left. This is the one used for encoding. We now refine our attention and analyze Huffman codes and Shannon-Fano-Elias codes. For Huffman coding, if the cryptanalyst knows the construction rule and PMF then there would be only one candidate codeword set. There is no ambiguity at all. Thus Huffman coding is easy to break and is not applicable when encryption is required along with compression. For Shannon-Fano-Elias coding, even if the cryptanalyst knows the construction rule and PMF, there are still n! different possible keys. These keys may result in at most
12 n! candidate codeword sets. Notice that different keys may sometimes result in the same codeword set. n! is an exponential quantity according to Stirling s formula. Therefore it is practically impossible to break a Shannon-Fano-Elias encoded sequence by using only enumeration. Of course, the cryptanalyst may employ some cryptographic tricks to reduce the search space. But this can be taken care of by changing the PMF slightly while doing the encoding, making it virtually impossible to break. This is because there are infinite possible PMFs and each PMF can have as many as n! possible codeword sets. However we do not address this problem here since it is beyond the scope of this paper. 3 New coding As mentioned in the Introduction, the expected length of Shannon-Fano-Elias codes L SF E satisfies H + L SF E < H + 2. We see that L SF E is always larger than L Huffman, which has a range of H L Huffman < H +. Our motivation of this research is to develop a new coding algorithm that has smaller length than that of Shannon-Fano-Elias coding, while the security level remains the same, i.e., requiring exponential time to break. In this section we propose such a code. In the next section we show that the expected length of our new code satisfies H L ours < H + 2. Take the source symbol set X = {x, x 2,..., x n }. Let the probability of symbol x i be. The encoding is based on a binary tree, with left branches denoting 0 and right branches denoting. The root is at level 0. Levels of the binary tree indicate lengths of codewords. The function avail(j) denotes the number of leaves available to be chosen as a codeword at level j of the tree. Before encoding starts, avail(j) is initialized to 2 j for j. Algorithm 2 gives the process of assigning a codeword for symbol x i ( i n), based on the sum of the i probabilities p(x ) through p(x i ) of symbols x through x i, and 2
13 Algorithm 2 Assign a codeword for symbol x i. Input: p(x ), p(x 2 ),..., such that > 0 and i p(x k). Output: codeword(x i ). i #ofcodewordsrequired p(x k) length(x i ) min{j : avail(j) #ofcodewordsrequired} codeword(x i ) leftmost available leaf on level length(x i ) if avail(s)mod2 = 0 for length(x i ) s then b else b min{j : avail(j)mod2 = and avail(s)mod2 = 0 for length(x i ) s j + } end if for length(x i ) m b do avail(m) avail(m) end for for m > length(x i ) do avail(m) avail(m) 2 m length(x i) end for the probability of symbol x i. Algorithm 2 consists of two steps: the first step (first 3 lines) is for assigning a codeword for x i. The second step is for updating function avail(j). In the first step, when assigning a codeword for x i ( i < n), it is guaranteed that length(x i ) length(x i ) for symbol x i (i < i n) with probability p(x i ). This strategy always reserves just enough leaves (codewords) for potential symbols (symbols that have not been encountered yet) with high probabilities, thus the expected length is reduced as much as possible. To achieve this, we first evaluate the required number of codewords at the level to which x i will belong (i.e., level length(x i )). This number is i #of CodewordsRequired = p(x k) i + + = p(x k) where i p(x k) is the maximum number of symbols with the same or larger probability than that may be encountered in the future. We guarantee that they will be assigned codewords not longer than the one assigned to x i. The first + means one node is needed 3 (4)
14 for x i. The second + means one node at level length(x i ) must be reserved for the future symbols with probabilities less than. The decedents of this node will be codewords for those symbols. Algorithm 2 then checks the function avail(j) and finds the smallest j that satisfies avail(j) #of CodewordsRequired. Any leaf on level j can be the codeword for x i. Without loss of generality Algorithm 2 chooses the leftmost one. In the second step, Algorithm 2 updates function avail(j) according to the following rule: the nodes that are parents of the leaf that is used for coding x i are no longer available, and the children of the leaf that is used for coding x i are no longer available either. The whole process for encoding a source is given in Figure. Table 2 shows a simple example. The source contains 5 symbols. The probability p(x ) of the first symbol is 0.20, thus #ofcodewordsrequired = 0.20 = 5. Since initially avail(2) = 4 < 5 and avail(3) = 8 5, we will assign a 3-bit codeword to x. According to the algorithm the leftmost available node at level-3 of the tree, i.e., 000, is chosen to be the codeword for x. Since 000 has been used, nodes that are prefix of 000 (i.e., 0 and 00) and have 000 as prefix (such as 0000, 000, 000 etc.) cannot be used for codewords anymore. Therefore the updated function avail(j) should be: avail() = ( can be a codeword), avail(2) = 3 ( 0, 0, and can be codewords), avail(3) = 7 (all 3-bit sequences expect 000 can be codewords), avail(4) = 4 (all 4-bit sequences expect 0000 and 000 can be codewords) etc. We continue to assign codewords for x 2 through x 5 and update avail(j) accordingly. The final codeword assignment is illustrated in Figure 2. Similar to Shannon-Fano-Elias coding, for the proposed coding technique the probabilities of symbols can be input in any order. Therefore there exist n! different permutations of symbols making the cryptanalysis by enumeration impractical. Another advantage of the proposed method is that, since no reordering is required, the probabilities are scanned only once. This makes the computation faster than using Huffman coding. Also notice that the 4
15 avail(j) = 2 j for j > 0 i = Get Input p(x ),p(x 2 ),, to Algorithm 2. Algorithm 2 outputs the codeword for x i End No p(x ) + p(x 2 ) + + <? Yes i = i + Figure. Encoding Process 5
16 Table 2: An Example of the Construction of the New Code i #ofcodewordrequired Length(x i ) Codeword avail(j) avail() = 2 avail(2) = 4 avail(3) = 8 avail(4) = avail() = avail(2) = 3 avail(3) = 7 avail(4) = avail() = avail(2) = 2 avail(3) = 5 avail(4) = avail() = avail(2) = 2 avail(3) = 4 avail(4) = avail() = 0 avail(2) = avail(3) = 2 avail(4) = avail() = 0 avail(2) = 0 avail(3) = 0 avail(4) = 0 6
17 0 x 2 x 4 x 5 x x 3 Figure 2. Example for the Encoding Process 7
18 probabilities of symbols that have occurred till the current symbol are needed to find the codeword for the current symbol. The probabilities of the symbols that have not yet occurred are not needed. The property could be very useful if the symbol set is large. 4 Properties In this section we first show some important properties of our new code. We then prove that the expected length satisfies the following inequality: H L ours < H + 2. (5) Theorem. The proposed codes are prefix codes. Proof. From the construction rule we see that only leaves on the binary tree are used for codewords. Internal nodes are not used for codewords. Theorem 2. Breaking a file encoded with the proposed codes by enumeration requires exponential time. Proof. The n source symbols can be input to the proposed encoding algorithm with any order. Therefore the encoding process could result in as many as n! different codeword sets. Since n! is an exponential quantity, examining all of them by exhaustive search requires exponential time. Let length(x i ) denote the length of the codeword for symbol x i ( i n). Suppose we are now assigning a codeword for symbol x i. For the previous i symbols, we have length(x k ) {j, j 2,..., j u } for k i. Without loss of generality let 0 < j < j 2 < < j u. Suppose out of the i symbols have a length of j k where k u. Obviously i = u. 8
19 Let x and y be real numbers. The following two propositions are used in the proof of Lemma 3.. If x < y then x < y If x y then x y. Lemma 3. For i n, log 2 i p(x k) u length(x i) log 2 p(x i + ) u. (6) i p(x k) Proof. By the code construction rule, before a codeword is assigned to x i, length(x i ) satisfies avail(length(x i ) ) < #ofcodewordsrequired avail(length(x i )). (7) where #of CodewordsRequired = i p(x k). We also have avail(length(xi )) avail(length(x i ) ) =. (8) 2 To find avail(length(x i )) and avail(length(x i ) ), three cases will be discussed. In Case, the length of the codeword for x i is smaller than the shortest codeword among x through x i. In Case 2, the length of the codeword for x i is larger than the shortest codeword among x through x i, and smaller than the longest codeword among x through x i. In Case 3, the length of the codeword for x i is larger than the longest codeword among x through x i. Case. length(x i ) < j. We need to compute the number of available codewords at level length(x i ) before assigning a codeword to x i. In order to do this, we first find the number of nodes at level length(x i ) that cannot be used as codewords because they are either codewords themselves or they are parents of nodes representing codewords at levels lower 9
20 than length(x i ). The lowest level of the tree is j u. The c u nodes at level j u are equivalent to nodes at level j u not being available. This number is added to c u, i.e., the c u 2 ju j u number of codewords at level j u. Likewise, the c u nodes at level j u and the c u nodes cu 2 at level j u are equivalent to ju j +c u u nodes at level j u 2 not being available. We 2 j u j u 2 continue computing like this. Finally we find that the i codewords for symbols x through x i are equivalent to cu 2 ju j +c u u 2 j u j u 2 +c u 2 2 j u 2 j u 3 2 j length(x i ) + c nodes at level length(x i ) not being available. Therefore avail(length(x i )) = 2 length(xi) u = 2 length(xi) Solving (0), (8) and (7) we get (6). = 2 length(x i) ( cu 2 ju j +c u u 2 j u j u 2 +c u 2 2 j u 2 j u 3 length(x i ) u 2 j length(x i ) ) + c (9). (0) Case 2. j r length(x i ) < j r+ where r u. If j k length(x i ) for k u, then the nodes at level j k are equivalent to 2 length(x i) j k nodes at level length(xi ) not being 20
21 available. Before assigning a codeword to x i, we have avail(length(x i )) = 2 length(x i) = 2 length(x i) = 2 length(x i) ( Solving (2), (8) and (7) we get (6). r 2 length(x i) j k r u 2 length(x i) j k u ) cu 2 ju j +c u u 2 j u j u 2 +c u 2 k=r+ 2 j u 2 j u 3 2 j r+ length(x i ) length(x i ) + c r+ (). (2) Case 3. length(x i ) j u. Before assigning a codeword to x i, we have Solving (3), (8) and (7) we get (6). avail(length(x i )) = 2 length(x i) u 2 length(x i) j k. (3) Lemma 4. For i n, i p(x k ) u. (4) Proof. We use induction. For the first symbol x, we have 2 length(x) < and hence length(x ) = p(x ) 2 length(x ) log 2 log p(x ) 2 p(x ). (5) This implies that p(x ) 2 length(x ). (6) 2
22 Assume for some integer I with I n we have I p(x k ) u. (7) where u = I. What will be shown is that I p(x k) u where u = I. Notice that I p(x k) = I p(x k) + p(x I ) and u Therefore we will prove that I p(x k ) + p(x I ) u + = u + 2 length(x I ). 2 length(x I). (8) By (7) and I p(x k) + p(x I ) we have ( I ) ( ) u I p(x k ) p(x I ) p(x k ) 0, (9) which is equivalent to p(x I ) because p(x I ) 0, u length(x I ) log 2 By (20) and (2), I ( p(x k) u I ( p(x k) p(x I ) u ) p(x I ) + I p(x k) u 0, and p(x I) + I p(x k) u ) and hence 2 length(x I) p(x I ) 2 p(x I I) + length(x I) I ( p(x k) u This implies (8). By induction we have proved (4). p(x k ) u (20) 0. By Lemma 3 ). (2). (22) Theorem 5. For i n, length(x i ) log 2 +. (23) 22
23 Proof. It is obvious that + i p(x k). By Lemma 4, we have + u and +2 u + i p(x k). This implies i p(x k)+ 2 2 u It is obvious that u 0 and p(x i) 0. Hence. i p(x k) + u 2 (24) and Therefore log 2 i p(x k) + u log 2 p(x i + ) u i p(x k) 2 log 2 = log 2 +. (25) log 2 +. (26) By Lemma 3 the left-hand-side is the upper bound of length(x i ). This proves (23). Theorem 6. For the same source, L ours < L SF E. (27) Proof. The length of the codeword for x is given by (5). For 2 i n, by Theorem 5, length(x i ) log 2 +. (28) Recall that for Shannon-Fano-Elias codes the codeword length is given by (3). Therefore L ours = n p(x k )length(x k ) < n ( ) p(x k ) log 2 + = L SF E. (29) p(x k ) Theorem 7. For a source with entropy H, H L ours < H + 2. (30) 23
24 Proof. By the definition of source entropy H = n p(x k) log 2 p(x k, it is straightforward ) from (29) that L ours < H + 2. According to Shannon s basic theorems introduced in [9], any prefix code with expected length L must satisfy L H. However, not all prefix codes can have an expected length equal to H. For example, the lower bound of Shannon-Fano-Elias codes is H + instead of H. We now show that, for our new code, if = 2 t i for i n where all t i are integers such that 0 < t t 2... t n, then L ours = H. Conditions = 2 t i and 0 < t t 2... t n imply that i p(x k) = i p(x k). (3) According to the construction rule, if p(x i ) for i < i n then length(x i ) length(x i ). Therefore this case belongs to Case 3 of Lemma 3. Then 2 length(x i ) u 2 length(x i) j k < i p(x k) u 2 length(xi) 2 length(x i) j k (32) 2 and thus and i length(x i ) = log p(x k) 2 ( u jk ). (33) We use induction. For i =, according to (5) we have length(x ) = log 2 p(x ) = t. Assume length(x i ) = log 2 for i I < n. Then p(x i) = 2 length(x i ) for i I I p(x k ) = where u = I. By (33) and (34), length(x I+ ) = log 2 p(x I+ ) u (34) = log 2 p(x I+ ) = t I+. (35) By induction length(x i ) = log 2 = t i for i n. Hence in this case L ours = H. Therefore the bounds of the expected length of the proposed code is given by (30). 24
25 Table 3: Shannon-Fano-Elias Code and Our New Code for the English Alphabet A to Z x i SFE codes new codes x i SFE codes new codes A N B O C P D Q E R F S G T H U I V J W K X L Y M Z Examples We first consider codes for the English alphabet with order A to Z. The codeword assignments for Shannon-Fano-Elias coding and the proposed coding are shown in Table 3. In this example, the source entropy is , the expected length of Huffman codes is , the expected length of Shannon-Fano-Elias codes is , and the expected length of our proposed code is We now consider codes for the English alphabet with order Z to A. The codeword assignments for Shannon-Fano-Elias coding and the proposed coding are shown in Table 4. In this case the expected length of the proposed codes is We see that the expected length of the proposed codes is much smaller than that of Shannon-Fano-Elias codes, and is pretty close to that of Huffman codes. Given that the new codes result in much better security than Huffman codes, the cost of code length increases 25
26 Table 4: Shannon-Fano-Elias Code and Our New Code for the English Alphabet Z to A x i SFE codes new codes x i SFE codes new codes Z M Y L X K W J V I U H T G S F R E Q D P C O B N A is reasonable. 6 Conclusion We have developed an innovative construction of variable-length prefix codes. To construct the proposed codes, the probabilities of source symbols do not need to be placed in nondecreasing order, unlike Huffman coding. We have shown that breaking a file encoded using the proposed code requires exponential time. Therefore compression and encryption can be combined. However, this is not the case for Huffman code, which proves to be vulnerable if the cryptanalyst knows the construction rule and PMF. The bounds of the expected code length of the new code are derived. It is shown that the proposed codes always have shorter length than that of Shannon-Fano-Elias codes. 26
27 We remark that using compression for data encryption might not be secure enough for those applications that require high level secrecy. However for some common applications, such as protecting copyrighted publications from illegal accesses, it is convenient to combine compression and encryption in order to make the cryptanalysis as expensive as the protected materials themselves. In those applications the new coding technique is very useful. References [] D. A. Huffman, A method for the construction of minimum redundency codes, Proc. IRE, vol. 40, pp , Sept [2] F. Robin, Cryptographic aspects of data compression codes, Cryptologia, vol. 3, no. 4, pp , 979. [3] S. T. Klein, A. Bookstein, and S. Deerwestern, Storing text retrieval systems on CD- ROM: compression and encryption considerations, ACM Trans. Inform. Systems, vol. 7, no. 3, pp , Jul., 989. [4] H. Lekatsas, J. Henkel, S. Chakradhar, and V. Jakkula, Cypress: compression and encryption of data and code for embedded multimedia systems, IEEE Proc. Design and Test of Computers, vol. 2, no. 5, pp , May [5] D. W. Gillman, M. Mohtashemi, and R. L. Rivest, On breaking a Huffman code, IEEE Trans. Inform. Theory, vol. 42, no. 3, pp , May 996. [6] A. S. Fraenkel and S. T. Klein, Complexity aspects of guessing prefix codes, Algorithmica, vol. 2, pp ,
28 [7] J. Yang, L. Gao, and Y. Zhang, Improving memory encryption performance in secure processors, IEEE Trans. Computers, vol. 54, no. 5, pp , May, [8] T. M. Cover and J. A. Thomas, Elements of Information Theory, John Wiley & Sons, 99. [9] C. E. Shannon, A mathematical theory of communication, Bell Sys, Tech. J., vol. 27, pp , , Jul.-Oct., 948. [0] N. Abramson, Information Theory and Coding, New York: McGraw-Hill, 963. [] B. McMillan, Two inequalities implied by unique decipherability, IRE Trans. Inform. Theory, pp. 5-6, Dec [2] K. Laković and J. Villasenor, On design of error-correcting reversible variable length codes, IEEE Commun. Letters, vol. 6, no. 8, pp , Aug
SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding
SIGNAL COMPRESSION Lecture 3 4.9.2007 Shannon-Fano-Elias Codes and Arithmetic Coding 1 Shannon-Fano-Elias Coding We discuss how to encode the symbols {a 1, a 2,..., a m }, knowing their probabilities,
More informationLecture 16. Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code
Lecture 16 Agenda for the lecture Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code Variable-length source codes with error 16.1 Error-free coding schemes 16.1.1 The Shannon-Fano-Elias
More information3F1 Information Theory, Lecture 3
3F1 Information Theory, Lecture 3 Jossy Sayir Department of Engineering Michaelmas 2011, 28 November 2011 Memoryless Sources Arithmetic Coding Sources with Memory 2 / 19 Summary of last lecture Prefix-free
More informationRun-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE
General e Image Coder Structure Motion Video x(s 1,s 2,t) or x(s 1,s 2 ) Natural Image Sampling A form of data compression; usually lossless, but can be lossy Redundancy Removal Lossless compression: predictive
More information10-704: Information Processing and Learning Fall Lecture 10: Oct 3
0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 0: Oct 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy of
More informationInformation Theory and Statistics Lecture 2: Source coding
Information Theory and Statistics Lecture 2: Source coding Łukasz Dębowski ldebowsk@ipipan.waw.pl Ph. D. Programme 2013/2014 Injections and codes Definition (injection) Function f is called an injection
More informationData Compression. Limit of Information Compression. October, Examples of codes 1
Data Compression Limit of Information Compression Radu Trîmbiţaş October, 202 Outline Contents Eamples of codes 2 Kraft Inequality 4 2. Kraft Inequality............................ 4 2.2 Kraft inequality
More informationOn the Cost of Worst-Case Coding Length Constraints
On the Cost of Worst-Case Coding Length Constraints Dror Baron and Andrew C. Singer Abstract We investigate the redundancy that arises from adding a worst-case length-constraint to uniquely decodable fixed
More informationChapter 5: Data Compression
Chapter 5: Data Compression Definition. A source code C for a random variable X is a mapping from the range of X to the set of finite length strings of symbols from a D-ary alphabet. ˆX: source alphabet,
More information3F1 Information Theory, Lecture 3
3F1 Information Theory, Lecture 3 Jossy Sayir Department of Engineering Michaelmas 2013, 29 November 2013 Memoryless Sources Arithmetic Coding Sources with Memory Markov Example 2 / 21 Encoding the output
More informationLecture 3 : Algorithms for source coding. September 30, 2016
Lecture 3 : Algorithms for source coding September 30, 2016 Outline 1. Huffman code ; proof of optimality ; 2. Coding with intervals : Shannon-Fano-Elias code and Shannon code ; 3. Arithmetic coding. 1/39
More informationChapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code
Chapter 3 Source Coding 3. An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code 3. An Introduction to Source Coding Entropy (in bits per symbol) implies in average
More informationEntropy as a measure of surprise
Entropy as a measure of surprise Lecture 5: Sam Roweis September 26, 25 What does information do? It removes uncertainty. Information Conveyed = Uncertainty Removed = Surprise Yielded. How should we quantify
More informationMultimedia Communications. Mathematical Preliminaries for Lossless Compression
Multimedia Communications Mathematical Preliminaries for Lossless Compression What we will see in this chapter Definition of information and entropy Modeling a data source Definition of coding and when
More informationLecture 4 : Adaptive source coding algorithms
Lecture 4 : Adaptive source coding algorithms February 2, 28 Information Theory Outline 1. Motivation ; 2. adaptive Huffman encoding ; 3. Gallager and Knuth s method ; 4. Dictionary methods : Lempel-Ziv
More informationUNIT I INFORMATION THEORY. I k log 2
UNIT I INFORMATION THEORY Claude Shannon 1916-2001 Creator of Information Theory, lays the foundation for implementing logic in digital circuits as part of his Masters Thesis! (1939) and published a paper
More informationInformation Theory with Applications, Math6397 Lecture Notes from September 30, 2014 taken by Ilknur Telkes
Information Theory with Applications, Math6397 Lecture Notes from September 3, 24 taken by Ilknur Telkes Last Time Kraft inequality (sep.or) prefix code Shannon Fano code Bound for average code-word length
More informationGeneralized Kraft Inequality and Arithmetic Coding
J. J. Rissanen Generalized Kraft Inequality and Arithmetic Coding Abstract: Algorithms for encoding and decoding finite strings over a finite alphabet are described. The coding operations are arithmetic
More informationOutline. Computer Science 418. Number of Keys in the Sum. More on Perfect Secrecy, One-Time Pad, Entropy. Mike Jacobson. Week 3
Outline Computer Science 48 More on Perfect Secrecy, One-Time Pad, Mike Jacobson Department of Computer Science University of Calgary Week 3 2 3 Mike Jacobson (University of Calgary) Computer Science 48
More informationAn Approximation Algorithm for Constructing Error Detecting Prefix Codes
An Approximation Algorithm for Constructing Error Detecting Prefix Codes Artur Alves Pessoa artur@producao.uff.br Production Engineering Department Universidade Federal Fluminense, Brazil September 2,
More informationEE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018
Please submit the solutions on Gradescope. EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018 1. Optimal codeword lengths. Although the codeword lengths of an optimal variable length code
More informationChapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code
Chapter 2 Date Compression: Source Coding 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code 2.1 An Introduction to Source Coding Source coding can be seen as an efficient way
More informationCoding of memoryless sources 1/35
Coding of memoryless sources 1/35 Outline 1. Morse coding ; 2. Definitions : encoding, encoding efficiency ; 3. fixed length codes, encoding integers ; 4. prefix condition ; 5. Kraft and Mac Millan theorems
More informationLecture 1 : Data Compression and Entropy
CPS290: Algorithmic Foundations of Data Science January 8, 207 Lecture : Data Compression and Entropy Lecturer: Kamesh Munagala Scribe: Kamesh Munagala In this lecture, we will study a simple model for
More information4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd Data Compression Q. Given a text that uses 32 symbols (26 different letters, space, and some punctuation characters), how can we
More informationShannon-Fano-Elias coding
Shannon-Fano-Elias coding Suppose that we have a memoryless source X t taking values in the alphabet {1, 2,..., L}. Suppose that the probabilities for all symbols are strictly positive: p(i) > 0, i. The
More informationHuffman Coding. C.M. Liu Perceptual Lab, College of Computer Science National Chiao-Tung University
Huffman Coding C.M. Liu Perceptual Lab, College of Computer Science National Chiao-Tung University http://www.csie.nctu.edu.tw/~cmliu/courses/compression/ Office: EC538 (03)573877 cmliu@cs.nctu.edu.tw
More informationlossless, optimal compressor
6. Variable-length Lossless Compression The principal engineering goal of compression is to represent a given sequence a, a 2,..., a n produced by a source as a sequence of bits of minimal possible length.
More information1 Introduction to information theory
1 Introduction to information theory 1.1 Introduction In this chapter we present some of the basic concepts of information theory. The situations we have in mind involve the exchange of information through
More informationData Compression Techniques
Data Compression Techniques Part 2: Text Compression Lecture 5: Context-Based Compression Juha Kärkkäinen 14.11.2017 1 / 19 Text Compression We will now look at techniques for text compression. These techniques
More informationCS4800: Algorithms & Data Jonathan Ullman
CS4800: Algorithms & Data Jonathan Ullman Lecture 22: Greedy Algorithms: Huffman Codes Data Compression and Entropy Apr 5, 2018 Data Compression How do we store strings of text compactly? A (binary) code
More informationSource Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria
Source Coding Master Universitario en Ingeniería de Telecomunicación I. Santamaría Universidad de Cantabria Contents Introduction Asymptotic Equipartition Property Optimal Codes (Huffman Coding) Universal
More informationBandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)
Compression Motivation Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet) Storage: Store large & complex 3D models (e.g. 3D scanner
More informationLec 03 Entropy and Coding II Hoffman and Golomb Coding
CS/EE 5590 / ENG 40 Special Topics Multimedia Communication, Spring 207 Lec 03 Entropy and Coding II Hoffman and Golomb Coding Zhu Li Z. Li Multimedia Communciation, 207 Spring p. Outline Lecture 02 ReCap
More informationSIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding
SIGNAL COMPRESSION Lecture 7 Variable to Fix Encoding 1. Tunstall codes 2. Petry codes 3. Generalized Tunstall codes for Markov sources (a presentation of the paper by I. Tabus, G. Korodi, J. Rissanen.
More informationCryptography and Network Security Prof. D. Mukhopadhyay Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur
Cryptography and Network Security Prof. D. Mukhopadhyay Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Module No. # 01 Lecture No. # 08 Shannon s Theory (Contd.)
More information4. Quantization and Data Compression. ECE 302 Spring 2012 Purdue University, School of ECE Prof. Ilya Pollak
4. Quantization and Data Compression ECE 32 Spring 22 Purdue University, School of ECE Prof. What is data compression? Reducing the file size without compromising the quality of the data stored in the
More informationSource Coding Techniques
Source Coding Techniques. Huffman Code. 2. Two-pass Huffman Code. 3. Lemple-Ziv Code. 4. Fano code. 5. Shannon Code. 6. Arithmetic Code. Source Coding Techniques. Huffman Code. 2. Two-path Huffman Code.
More informationOptimum Binary-Constrained Homophonic Coding
Optimum Binary-Constrained Homophonic Coding Valdemar C. da Rocha Jr. and Cecilio Pimentel Communications Research Group - CODEC Department of Electronics and Systems, P.O. Box 7800 Federal University
More informationMotivation for Arithmetic Coding
Motivation for Arithmetic Coding Motivations for arithmetic coding: 1) Huffman coding algorithm can generate prefix codes with a minimum average codeword length. But this length is usually strictly greater
More informationPART III. Outline. Codes and Cryptography. Sources. Optimal Codes (I) Jorge L. Villar. MAMME, Fall 2015
Outline Codes and Cryptography 1 Information Sources and Optimal Codes 2 Building Optimal Codes: Huffman Codes MAMME, Fall 2015 3 Shannon Entropy and Mutual Information PART III Sources Information source:
More informationData Compression Techniques
Data Compression Techniques Part 1: Entropy Coding Lecture 4: Asymmetric Numeral Systems Juha Kärkkäinen 08.11.2017 1 / 19 Asymmetric Numeral Systems Asymmetric numeral systems (ANS) is a recent entropy
More informationInformation and Entropy
Information and Entropy Shannon s Separation Principle Source Coding Principles Entropy Variable Length Codes Huffman Codes Joint Sources Arithmetic Codes Adaptive Codes Thomas Wiegand: Digital Image Communication
More informationTight Upper Bounds on the Redundancy of Optimal Binary AIFV Codes
Tight Upper Bounds on the Redundancy of Optimal Binary AIFV Codes Weihua Hu Dept. of Mathematical Eng. Email: weihua96@gmail.com Hirosuke Yamamoto Dept. of Complexity Sci. and Eng. Email: Hirosuke@ieee.org
More informationU Logo Use Guidelines
COMP2610/6261 - Information Theory Lecture 15: Arithmetic Coding U Logo Use Guidelines Mark Reid and Aditya Menon logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature
More informationTight Bounds on Minimum Maximum Pointwise Redundancy
Tight Bounds on Minimum Maximum Pointwise Redundancy Michael B. Baer vlnks Mountain View, CA 94041-2803, USA Email:.calbear@ 1eee.org Abstract This paper presents new lower and upper bounds for the optimal
More informationReserved-Length Prefix Coding
Reserved-Length Prefix Coding Michael B. Baer vlnks Mountain View, CA 94041-2803, USA Email:.calbear@ 1eee.org Abstract Huffman coding finds an optimal prefix code for a given probability mass function.
More informationLecture 6: Kraft-McMillan Inequality and Huffman Coding
EE376A/STATS376A Information Theory Lecture 6-0/25/208 Lecture 6: Kraft-McMillan Inequality and Huffman Coding Lecturer: Tsachy Weissman Scribe: Akhil Prakash, Kai Yee Wan In this lecture, we begin with
More informationThe simple ideal cipher system
The simple ideal cipher system Boris Ryabko February 19, 2001 1 Prof. and Head of Department of appl. math and cybernetics Siberian State University of Telecommunication and Computer Science Head of Laboratory
More informationIntro to Information Theory
Intro to Information Theory Math Circle February 11, 2018 1. Random variables Let us review discrete random variables and some notation. A random variable X takes value a A with probability P (a) 0. Here
More informationInformation Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 13 Competitive Optimality of the Shannon Code So, far we have studied
More informationEECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have
EECS 229A Spring 2007 * * Solutions to Homework 3 1. Problem 4.11 on pg. 93 of the text. Stationary processes (a) By stationarity and the chain rule for entropy, we have H(X 0 ) + H(X n X 0 ) = H(X 0,
More informationCMPT 365 Multimedia Systems. Lossless Compression
CMPT 365 Multimedia Systems Lossless Compression Spring 2017 Edited from slides by Dr. Jiangchuan Liu CMPT365 Multimedia Systems 1 Outline Why compression? Entropy Variable Length Coding Shannon-Fano Coding
More informationModified Berlekamp-Massey algorithm for approximating the k-error linear complexity of binary sequences
Loughborough University Institutional Repository Modified Berlekamp-Massey algorithm for approximating the k-error linear complexity of binary sequences This item was submitted to Loughborough University's
More informationCSCI 2570 Introduction to Nanocomputing
CSCI 2570 Introduction to Nanocomputing Information Theory John E Savage What is Information Theory Introduced by Claude Shannon. See Wikipedia Two foci: a) data compression and b) reliable communication
More informationKolmogorov complexity ; induction, prediction and compression
Kolmogorov complexity ; induction, prediction and compression Contents 1 Motivation for Kolmogorov complexity 1 2 Formal Definition 2 3 Trying to compute Kolmogorov complexity 3 4 Standard upper bounds
More informationReserved-Length Prefix Coding
Reserved-Length Prefix Coding Michael B. Baer Ocarina Networks 42 Airport Parkway San Jose, California 95110-1009 USA Email:icalbear@ 1eee.org arxiv:0801.0102v1 [cs.it] 30 Dec 2007 Abstract Huffman coding
More informationAsymptotic redundancy and prolixity
Asymptotic redundancy and prolixity Yuval Dagan, Yuval Filmus, and Shay Moran April 6, 2017 Abstract Gallager (1978) considered the worst-case redundancy of Huffman codes as the maximum probability tends
More informationText Compression. Jayadev Misra The University of Texas at Austin December 5, A Very Incomplete Introduction to Information Theory 2
Text Compression Jayadev Misra The University of Texas at Austin December 5, 2003 Contents 1 Introduction 1 2 A Very Incomplete Introduction to Information Theory 2 3 Huffman Coding 5 3.1 Uniquely Decodable
More informationBasic Principles of Lossless Coding. Universal Lossless coding. Lempel-Ziv Coding. 2. Exploit dependences between successive symbols.
Universal Lossless coding Lempel-Ziv Coding Basic principles of lossless compression Historical review Variable-length-to-block coding Lempel-Ziv coding 1 Basic Principles of Lossless Coding 1. Exploit
More informationLecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1)
3- Mathematical methods in communication Lecture 3 Lecturer: Haim Permuter Scribe: Yuval Carmel, Dima Khaykin, Ziv Goldfeld I. REMINDER A. Convex Set A set R is a convex set iff, x,x 2 R, θ, θ, θx + θx
More informationData Compression Techniques (Spring 2012) Model Solutions for Exercise 2
582487 Data Compression Techniques (Spring 22) Model Solutions for Exercise 2 If you have any feedback or corrections, please contact nvalimak at cs.helsinki.fi.. Problem: Construct a canonical prefix
More informationThe Optimal Fix-Free Code for Anti-Uniform Sources
Entropy 2015, 17, 1379-1386; doi:10.3390/e17031379 OPEN ACCESS entropy ISSN 1099-4300 www.mdpi.com/journal/entropy Article The Optimal Fix-Free Code for Anti-Uniform Sources Ali Zaghian 1, Adel Aghajan
More informationLecture 1: September 25, A quick reminder about random variables and convexity
Information and Coding Theory Autumn 207 Lecturer: Madhur Tulsiani Lecture : September 25, 207 Administrivia This course will cover some basic concepts in information and coding theory, and their applications
More informationAn introduction to basic information theory. Hampus Wessman
An introduction to basic information theory Hampus Wessman Abstract We give a short and simple introduction to basic information theory, by stripping away all the non-essentials. Theoretical bounds on
More informationTHIS paper is aimed at designing efficient decoding algorithms
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 45, NO. 7, NOVEMBER 1999 2333 Sort-and-Match Algorithm for Soft-Decision Decoding Ilya Dumer, Member, IEEE Abstract Let a q-ary linear (n; k)-code C be used
More informationEE376A - Information Theory Midterm, Tuesday February 10th. Please start answering each question on a new page of the answer booklet.
EE376A - Information Theory Midterm, Tuesday February 10th Instructions: You have two hours, 7PM - 9PM The exam has 3 questions, totaling 100 points. Please start answering each question on a new page
More informationChapter 2: Source coding
Chapter 2: meghdadi@ensil.unilim.fr University of Limoges Chapter 2: Entropy of Markov Source Chapter 2: Entropy of Markov Source Markov model for information sources Given the present, the future is independent
More informationChapter 9 Fundamental Limits in Information Theory
Chapter 9 Fundamental Limits in Information Theory Information Theory is the fundamental theory behind information manipulation, including data compression and data transmission. 9.1 Introduction o For
More informationHalting and Equivalence of Program Schemes in Models of Arbitrary Theories
Halting and Equivalence of Program Schemes in Models of Arbitrary Theories Dexter Kozen Cornell University, Ithaca, New York 14853-7501, USA, kozen@cs.cornell.edu, http://www.cs.cornell.edu/~kozen In Honor
More informationMultimedia. Multimedia Data Compression (Lossless Compression Algorithms)
Course Code 005636 (Fall 2017) Multimedia Multimedia Data Compression (Lossless Compression Algorithms) Prof. S. M. Riazul Islam, Dept. of Computer Engineering, Sejong University, Korea E-mail: riaz@sejong.ac.kr
More informationOptimal codes - I. A code is optimal if it has the shortest codeword length L. i i. This can be seen as an optimization problem. min.
Huffman coding Optimal codes - I A code is optimal if it has the shortest codeword length L L m = i= pl i i This can be seen as an optimization problem min i= li subject to D m m i= lp Gabriele Monfardini
More informationCSE 421 Greedy: Huffman Codes
CSE 421 Greedy: Huffman Codes Yin Tat Lee 1 Compression Example 100k file, 6 letter alphabet: File Size: ASCII, 8 bits/char: 800kbits 2 3 > 6; 3 bits/char: 300kbits better: 2.52 bits/char 74%*2 +26%*4:
More informationIntroduction to algebraic codings Lecture Notes for MTH 416 Fall Ulrich Meierfrankenfeld
Introduction to algebraic codings Lecture Notes for MTH 416 Fall 2014 Ulrich Meierfrankenfeld December 9, 2014 2 Preface These are the Lecture Notes for the class MTH 416 in Fall 2014 at Michigan State
More informationIntroduction to Languages and Computation
Introduction to Languages and Computation George Voutsadakis 1 1 Mathematics and Computer Science Lake Superior State University LSSU Math 400 George Voutsadakis (LSSU) Languages and Computation July 2014
More information3F1: Signals and Systems INFORMATION THEORY Examples Paper Solutions
Engineering Tripos Part IIA THIRD YEAR 3F: Signals and Systems INFORMATION THEORY Examples Paper Solutions. Let the joint probability mass function of two binary random variables X and Y be given in the
More information17.1 Binary Codes Normal numbers we use are in base 10, which are called decimal numbers. Each digit can be 10 possible numbers: 0, 1, 2, 9.
( c ) E p s t e i n, C a r t e r, B o l l i n g e r, A u r i s p a C h a p t e r 17: I n f o r m a t i o n S c i e n c e P a g e 1 CHAPTER 17: Information Science 17.1 Binary Codes Normal numbers we use
More informationPERFECTLY secure key agreement has been studied recently
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 45, NO. 2, MARCH 1999 499 Unconditionally Secure Key Agreement the Intrinsic Conditional Information Ueli M. Maurer, Senior Member, IEEE, Stefan Wolf Abstract
More informationEntropy and Ergodic Theory Lecture 3: The meaning of entropy in information theory
Entropy and Ergodic Theory Lecture 3: The meaning of entropy in information theory 1 The intuitive meaning of entropy Modern information theory was born in Shannon s 1948 paper A Mathematical Theory of
More informationOn the redundancy of optimum fixed-to-variable length codes
On the redundancy of optimum fixed-to-variable length codes Peter R. Stubley' Bell-Northern Reserch Abstract There has been much interest in recent years in bounds on the redundancy of Huffman codes, given
More informationSources: The Data Compression Book, 2 nd Ed., Mark Nelson and Jean-Loup Gailly.
Lossless ompression Multimedia Systems (Module 2 Lesson 2) Summary: daptive oding daptive Huffman oding Sibling Property Update lgorithm rithmetic oding oding and ecoding Issues: OF problem, Zero frequency
More informationQuantum-inspired Huffman Coding
Quantum-inspired Huffman Coding A. S. Tolba, M. Z. Rashad, and M. A. El-Dosuky Dept. of Computer Science, Faculty of Computers and Information Sciences, Mansoura University, Mansoura, Egypt. tolba_954@yahoo.com,
More informationCOS597D: Information Theory in Computer Science October 19, Lecture 10
COS597D: Information Theory in Computer Science October 9, 20 Lecture 0 Lecturer: Mark Braverman Scribe: Andrej Risteski Kolmogorov Complexity In the previous lectures, we became acquainted with the concept
More informationCOMMUNICATION SCIENCES AND ENGINEERING
COMMUNICATION SCIENCES AND ENGINEERING X. PROCESSING AND TRANSMISSION OF INFORMATION Academic and Research Staff Prof. Peter Elias Prof. Robert G. Gallager Vincent Chan Samuel J. Dolinar, Jr. Richard
More informationECE 587 / STA 563: Lecture 5 Lossless Compression
ECE 587 / STA 563: Lecture 5 Lossless Compression Information Theory Duke University, Fall 2017 Author: Galen Reeves Last Modified: October 18, 2017 Outline of lecture: 5.1 Introduction to Lossless Source
More informationA Mathematical Theory of Communication
A Mathematical Theory of Communication Ben Eggers Abstract This paper defines information-theoretic entropy and proves some elementary results about it. Notably, we prove that given a few basic assumptions
More informationTTIC 31230, Fundamentals of Deep Learning David McAllester, April Information Theory and Distribution Modeling
TTIC 31230, Fundamentals of Deep Learning David McAllester, April 2017 Information Theory and Distribution Modeling Why do we model distributions and conditional distributions using the following objective
More informationIntroduction to information theory and coding
Introduction to information theory and coding Louis WEHENKEL Set of slides No 5 State of the art in data compression Stochastic processes and models for information sources First Shannon theorem : data
More informationThe Hackenbush Number System for Compression of Numerical Data
INFORMATION AND CONTROL 26, 134--140 (1974) The Hackenbush Number System for Compression of Numerical Data E. R. BERLEKAMP* Electronics Research Laboratory, College of Engineering, University of California,
More informationCS6304 / Analog and Digital Communication UNIT IV - SOURCE AND ERROR CONTROL CODING PART A 1. What is the use of error control coding? The main use of error control coding is to reduce the overall probability
More information1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H.
Problem sheet Ex. Verify that the function H(p,..., p n ) = k p k log p k satisfies all 8 axioms on H. Ex. (Not to be handed in). looking at the notes). List as many of the 8 axioms as you can, (without
More informationInformation Theory, Statistics, and Decision Trees
Information Theory, Statistics, and Decision Trees Léon Bottou COS 424 4/6/2010 Summary 1. Basic information theory. 2. Decision trees. 3. Information theory and statistics. Léon Bottou 2/31 COS 424 4/6/2010
More informationEntropy Coding. Connectivity coding. Entropy coding. Definitions. Lossles coder. Input: a set of symbols Output: bitstream. Idea
Connectivity coding Entropy Coding dd 7, dd 6, dd 7, dd 5,... TG output... CRRRLSLECRRE Entropy coder output Connectivity data Edgebreaker output Digital Geometry Processing - Spring 8, Technion Digital
More information( c ) E p s t e i n, C a r t e r a n d B o l l i n g e r C h a p t e r 1 7 : I n f o r m a t i o n S c i e n c e P a g e 1
( c ) E p s t e i n, C a r t e r a n d B o l l i n g e r 2 0 1 6 C h a p t e r 1 7 : I n f o r m a t i o n S c i e n c e P a g e 1 CHAPTER 17: Information Science In this chapter, we learn how data can
More information2018/5/3. YU Xiangyu
2018/5/3 YU Xiangyu yuxy@scut.edu.cn Entropy Huffman Code Entropy of Discrete Source Definition of entropy: If an information source X can generate n different messages x 1, x 2,, x i,, x n, then the
More informationInformation Theory CHAPTER. 5.1 Introduction. 5.2 Entropy
Haykin_ch05_pp3.fm Page 207 Monday, November 26, 202 2:44 PM CHAPTER 5 Information Theory 5. Introduction As mentioned in Chapter and reiterated along the way, the purpose of a communication system is
More informationAutumn Coping with NP-completeness (Conclusion) Introduction to Data Compression
Autumn Coping with NP-completeness (Conclusion) Introduction to Data Compression Kirkpatrick (984) Analogy from thermodynamics. The best crystals are found by annealing. First heat up the material to let
More informationTransducers for bidirectional decoding of prefix codes
Transducers for bidirectional decoding of prefix codes Laura Giambruno a,1, Sabrina Mantaci a,1 a Dipartimento di Matematica ed Applicazioni - Università di Palermo - Italy Abstract We construct a transducer
More informationAn instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1
Kraft s inequality An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if N 2 l i 1 Proof: Suppose that we have a tree code. Let l max = max{l 1,...,
More informationECE 587 / STA 563: Lecture 5 Lossless Compression
ECE 587 / STA 563: Lecture 5 Lossless Compression Information Theory Duke University, Fall 28 Author: Galen Reeves Last Modified: September 27, 28 Outline of lecture: 5. Introduction to Lossless Source
More information