Using an innovative coding algorithm for data encryption

Size: px

Start display at page:

Download "Using an innovative coding algorithm for data encryption"

Angela Woods
5 years ago
Views:

1 Using an innovative coding algorithm for data encryption Xiaoyu Ruan and Rajendra S. Katti Abstract This paper discusses the problem of using data compression for encryption. We first propose an algorithm for breaking a prefix-coded file by enumeration. Based on the algorithm, we respectively analyze the complexity of breaking Huffman codes and Shannon-Fano-Elias codes under the assumption that the cryptanalyst knows the code construction rule and the probability mass function of the source. It is shown that under this assumption Huffman codes are vulnerable but Shannon-Fano-Elias codes need exponential time to break. We then propose an innovative construction of variable-length prefix codes that have the same security as Shannon-Fano-Elias codes but the expected length is smaller. keywords Cryptography, data compression, Huffman codes, prefix codes, Shannon- Fano-Elias codes, variable-length codes. This work has been supported in part by the National Science Foundation under Grant CCR The authors are with the Department of Electrical and Computer Engineering, North Dakota State University, Fargo, ND , USA ( xiaoyu.ruan@ndsu.edu; rajendra.katti@ndsu.edu).

2 Introduction Huffman coding [] is one of the best-known compression techniques that produces optimal compression for any given probability distribution. The use of Huffman codes for encryption has been considered in [2] and [3]. These works were motivated by security requirements for storing a large textual database on a CD-ROM. The text of the database needed to be compressed for memory efficiency and encrypted to prevent illegal use of copyrighted material. Using variable-length prefix codes for encryption does not lead to absolute secrecy as might be required by some military applications but results in making the cryptanalysis difficult enough so that the cost of decryption exceeds any potential profit incurred by breaking the code. Another application where compression and encryption are needed is embedded multimedia systems [4], such systems require both data and programs to be compressed and encrypted and stored in main memory. Both data and programs are decrypted and decompressed after they enter the processor which is considered to be secure. In this case using compression as a means to achieve security can improve performance and memory requirement of the system. Increased security can be obtained by using simpler encryption methods in addition to using compression. In [5], it was shown that if the cryptanalyst knows whether the encoder is using an arbitrary Huffman code or a right-heavy Huffman code 2, but does not know the Probability mass function (PMF) of the source, then a Huffman coded file can be surprisingly difficult to cryptanalyze. However, this assumption of the cryptanalyst not knowing the PMF is invalid Whenever two subtrees are combined, an arbitrary decision may be made as to which subtree has greater total weight (probability). 2 The subtree of greater total weight is always made the right subtree. 2

3 especially for text data where the PMF is usually known. In this paper we assume that the cryptanalyst not only knows the construction procedure of the codes but also knows the PMF. Under this assumption, Huffman codes can be easily decoded and essentially provide no security. Fraenkel and Klein [6] have enhanced prefix codes by adding a short sequence of random bits to some of the codewords. They then show that decoding such a code is NP-complete. However the drawback of this method is that adding random bits increases the average length of the code. A better method would be to XOR the compressed sequence with a sequence of bits obtained by encrypting a seed as was done in [7]. This would maintain the compression properties and improve the security of the code. In this paper we consider using a variable-length prefix code for encryption. Although Huffman codes produce optimal compression, their encryption capabilities are not very good. Shannon-Fano-Elias codes [8] are a better candidate for encryption but their compression capabilities are inferior compared to Huffman codes. In this paper we show that it is more difficult to cryptanalyze Shannon-Fano-Elias codes than Huffman codes. We then propose a new code that is as good for encryption as Shannon-Fano-Elias codes but is almost as good as Huffman codes for compression. Assume that the information source puts out a sequence of symbols from the set X = {x, x 2,..., x n } with PMF {p(x ), p(x 2 ),..., p(x n )}. This implies that p(x = x i ) = where i =, 2,..., n. Each symbol in the sequence is then assigned a codeword such that the expected length of the code is small, and it is hard for an intruder to decode the sequence of codewords into the symbols being output by the source. The expected length is defined as L = n l(x i ), () i= 3

4 where l(x i ) denotes the length of the codeword for the symbol x i with probability. It is assumed that the intruder knows the PMF and the coding method. In Huffman coding, the probabilities of source symbols must be ordered non-increasingly before encoding. In other words, the probabilities of source symbols must be examined at least twice before proper codewords can be assigned to each symbol: once for reordering symbols and the other for assigning codewords. Huffman codes have an optimal expected length H L Huffman < H + where H represents the entropy of the source. Shannon mentioned an approach using the cumulative distribution function when describing the Shannon-Fano code [9]. Elias later came up with a recursive implementation for this idea. It is now known as Shannon-Fano-Elias coding. However Elias never published it. It was first introduced in a 963 information theory book by Abramson [0]. We now review the construction of Shannon-Fano-Elias codes. The modified cumulative distribution function is defined as i F () = p(x k ) + 2 p(x i). (2) F () represents the sum of probabilities of symbols x through x i plus half of the probability of x i. Notice that it is not required to order the symbols based on their probabilities. Therefore different orderings of symbols lead to different codes. Since the random variable is discrete, the cumulative function consists of steps of size. Therefore we can determine x i if we know F (). In general F () is a real number only expressible by an infinite number of bits. F () is rounded off to l SF E (x i ) = log + (3) bits, denoted by F () l SF E (x i ). It can be shown that the set of lengths l SF E(x i ) ( i n) satisfies the Kraft inequality [] and hence can be used to construct a uniquely decodable code. F (p(xi )) l SF E(x i ) is within the step corresponding to x i. Thus we can use the first 4

5 l SF E (x i ) bits of F () to describe x i. Shannon-Fano-Elias coding uses the binary form of F (p(xi )) l SF E(x i ) as the codeword for x i. The resulting codewords prove to be prefix-free. The expected length L SF E computed using () satisfies H + L SF E < H + 2. We remark that, compared to Huffman coding, in Shannon-Fano-Elias coding the probabilities of symbols can be arranged in any order. There exist n! different orders. This feature might be employed to increase the ambiguity of the codes and hence make the code difficult to break. For example, probabilities of the English alphabet occurring in literature [2] are shown in the second column of Table 3 or Table 4. The corresponding codewords for the alphabet ordered from A to Z are listed in the third column of Table 3 and that for the order Z to A are listed in the third column of Table 4. For the English alphabet there exist 26! different permutations of symbols. Table 3 and Table 4 show two out of the 26! cases. At a glance, one may wonder if the resulting codeword set for the reverse order of symbols is always the complement of the codeword set for the original order or not. In fact, this is not true. For example, for a 3-symbol source {s, s 2, s 3 }, let p(s ) = p(s 2 ) = 0.25 and p(s 3 ) = 0.5. If symbols are ordered as s, s 2, s 3, then the codewords are 00, 0, and for s, s 2, and s 3, respectively. If symbols are ordered as s 3, s 2, s, then the codewords are 0, 0, and for s 3, s 2, and s, respectively. The the rest of the paper is organized as follows. Section 2 gives an algorithm for breaking a prefix code by enumeration and analyzes the complexity of breaking a Huffman code and a Shannon-Fano-Elias code. Although Shannon-Fano-Elias coding results in better security than Huffman coding, as we will see later, it increases the expected code length thus making it inefficient. In Section 3 we propose an innovative coding algorithm that results in the same security as Shannon-Fano-Elias coding but the expected code length is less. In Section 4 we prove some properties of the proposed code. Numerical examples are shown in Section 5. Finally Section 6 concludes the paper. 5

6 2 Breaking a prefix code by enumeration Suppose the cryptanalyst wants to break a prefix-coded sequence. In order to do this his goal is to find out the codeword set used by the encoder. Once this is done then some efficient method can perhaps be utilized to match codewords and symbols. Assume that he has worked out several candidate codeword sets by some means. Furthermore, he has received a few encoded sequences. Now he needs to examine the candidate sets and find out the right one. Let S 0 be the encoded sequence to be analyzed. S 0 is one of the encoded sequences that the cryptanalyst has received. Variable S is initialized to S 0 and will be progressively updated by Algorithm. S is a binary sequence denoted by (b b 2 ). Algorithm scans S 0 and checks whether a candidate codeword set ϕ = {C, C 2,..., C n, C n } could be the one used for encoding. Without loss of generality we let length(c ) length(c 2 )... length(c n ) length(c n ) where function length(c i ) returns the length of codeword C i. Algorithm first searches S for the longest codeword and then progressively searches for shorter and shorter codewords. Suppose sequence C i occurs t (t > 0) times in S. Notice that an occurrence may overlap with others. The bits to the left of each of the t occurrences are checked. If none of them match any codeword in the candidate set, then each of the t found sequences, though the same as C i, is not actually a codeword, but either the combination of a suffix and a prefix of two codewords or an infix of a codeword. i.e., C i is not in the codeword set used for encoding. If for some of the t occurrences, the left neighboring bits match a codeword in the candidate set, then C i is indeed in the codeword set used for encoding. The rest of the occurrences are either combinations of a suffix and a prefix of two codewords or infix of a codeword. We then remove the occurrences of C i which have codewords to their left. The updated S is again searched for the next-longest codeword, i.e., C i+. The process 6

7 Algorithm Check if a candidate codeword set may be the one used for encoding or cannot be the one used for encoding. i S S 0 γ π while i n do β search for C i in S from left to right, t C i s are found if t = 0 then γ γ {C i } else for k t do p k index of the leftmost bit of the k th C i end for if p = then β β {} end if for k t do if p k length(c n ) then for length(c n ) j min{length(c i ), p k } do if (b pk jb pk j+... b pk 2b pk ) ϕ then β β {k} break end if end for end if end for if β = then γ γ {C i } else π π {C i } S removing the k th C i from S (for all k β) end if end if i i + end while if S is empty then return ( This codeword set may be the one used for encoding. Elements in π are in the codeword set used for encoding. Unable to decide if elements in γ are in the codeword set used for encoding or not. ) else return ( This codeword set is not the one used for encoding. ) end if 7

8 continues until all codewords in the candidate set are checked. At the end of this process, if S is empty then the candidate set could be the set for encoding, otherwise the candidate set is not the one used for encoding. Since the code is prefix-free, a codeword is not a prefix of any other codewords. Furthermore, a codeword must not be an infix or suffix of other codewords of equal or smaller length. This explains the principle and correctness of Algorithm. To illustrate the operations of Algorithm, we consider an example that uses Shannon- Fano-Elias coding. Table shows the source symbols and PMF. There are three symbols and thus 3! = 6 different ways in which they can be ordered. We define an order to be a key, which has been transmitted to legal receivers via private channels. Suppose one of the encoded sequences the cryptanalyst received is Firstly the codeword set {0, 0, 0}, corresponding to key ABC, is examined. There does not exist 0 in the sequence. Searching for 0 we find one occurrence: ˆ The leading bit is hatted. The bits to the left of 0 are neither 0 nor 0, so 0 is not in the codeword set used for encoding. Finally we search for 0 and find 5 occurrences: ˆ0ˆ0ˆ0ˆ00ˆ00ˆ0ˆ00000ˆ0000ˆ0ˆ0ˆ0ˆ00ˆ00000ˆ00000ˆ0. We then remove the 0s that have a valid codeword to their left. Except for the leftmost 0, the left neighboring bits of the removed 0s are also 0. The resulting sequence then is

9 Table : An Example for Algorithm Key x i Symbol Codeword ABC x A x 2 B x 3 C ACB x A x 2 C x 3 B BAC x B x 2 A x 3 C BCA x B x 2 C x 3 A CBA x C x 2 B x 3 A CAB x C x 2 A x 3 B

10 Since the process ends up with a non-empty sequence, Algorithm concludes that {0, 0, 0} is not the right set. Candidate sets {000, 0, 0}, {0, 00, 0}, {00, 00, 0}, and {0000, 0, 0} can be eliminated likewise. Next consider {0000, 00, 0} corresponding to key CBA occurs four times in the sequence: ˆ0000ˆ ˆ00000ˆ0000. Picking any one of the four, we find that the left neighboring bits are in {0000, 00, 0}. We then update S by removing all four occurrences of 0000: Searching for 00 we find 2 occurrences: ˆ0ˆ0ˆ0ˆ00ˆ00ˆ0ˆ00ˆ0ˆ0ˆ00ˆ0ˆ00. We then remove the 00s with valid codewords to their left. Except for the leftmost 00, the left neighboring bits of the removed 00s are either 00 or 0. The remaining sequence is made up of consecutive codewords 0, which will be removed. Finally an empty sequence results. Therefore {0000, 00, 0} must be the codeword set used for encoding since all of the other five candidate sets have been eliminated. The corresponding key is CBA. The plaintext is BAAABABCCBAABCAC. It is possible that more than one candidate set could be the one used for encoding. In this case, Algorithm only reduces the size of the candidate set space, but does not specify which is the right one. Therefore, the cryptanalyst may need more than one encoded sequence to make a correct judgement. He has to check each of the candidate sets that have passed 0

11 the examinations by Algorithm for previous sequences. This is performed until only one candidate set is left. For example, consider the encoded sequence 000 which can be split into: For this sequence, Algorithm eliminates candidate sets {0, 00, 0}, {00, 00, 0}, and {0000, 00, 0}. Each of the other three sets could be the one used for encoding because they all contain codewords 0 and 0. The size of the candidate set space is reduced from 6 to 3. In order to find out the right one, analysis of more sequences is necessary. The general process of breaking a prefix-coded sequence is:. Find all possible codeword sets (i.e., candidates). 2. For an encoded sequence S 0, check all candidates by repeatedly using Algorithm. 3. Record the candidates that may be the one used for encoding. 4. If more than one set results in step 3) then pick another encoded sequence S 0 and repeat steps 2) and 3) until only one candidate is left. This is the one used for encoding. We now refine our attention and analyze Huffman codes and Shannon-Fano-Elias codes. For Huffman coding, if the cryptanalyst knows the construction rule and PMF then there would be only one candidate codeword set. There is no ambiguity at all. Thus Huffman coding is easy to break and is not applicable when encryption is required along with compression. For Shannon-Fano-Elias coding, even if the cryptanalyst knows the construction rule and PMF, there are still n! different possible keys. These keys may result in at most

12 n! candidate codeword sets. Notice that different keys may sometimes result in the same codeword set. n! is an exponential quantity according to Stirling s formula. Therefore it is practically impossible to break a Shannon-Fano-Elias encoded sequence by using only enumeration. Of course, the cryptanalyst may employ some cryptographic tricks to reduce the search space. But this can be taken care of by changing the PMF slightly while doing the encoding, making it virtually impossible to break. This is because there are infinite possible PMFs and each PMF can have as many as n! possible codeword sets. However we do not address this problem here since it is beyond the scope of this paper. 3 New coding As mentioned in the Introduction, the expected length of Shannon-Fano-Elias codes L SF E satisfies H + L SF E < H + 2. We see that L SF E is always larger than L Huffman, which has a range of H L Huffman < H +. Our motivation of this research is to develop a new coding algorithm that has smaller length than that of Shannon-Fano-Elias coding, while the security level remains the same, i.e., requiring exponential time to break. In this section we propose such a code. In the next section we show that the expected length of our new code satisfies H L ours < H + 2. Take the source symbol set X = {x, x 2,..., x n }. Let the probability of symbol x i be. The encoding is based on a binary tree, with left branches denoting 0 and right branches denoting. The root is at level 0. Levels of the binary tree indicate lengths of codewords. The function avail(j) denotes the number of leaves available to be chosen as a codeword at level j of the tree. Before encoding starts, avail(j) is initialized to 2 j for j. Algorithm 2 gives the process of assigning a codeword for symbol x i ( i n), based on the sum of the i probabilities p(x ) through p(x i ) of symbols x through x i, and 2

13 Algorithm 2 Assign a codeword for symbol x i. Input: p(x ), p(x 2 ),..., such that > 0 and i p(x k). Output: codeword(x i ). i #ofcodewordsrequired p(x k) length(x i ) min{j : avail(j) #ofcodewordsrequired} codeword(x i ) leftmost available leaf on level length(x i ) if avail(s)mod2 = 0 for length(x i ) s then b else b min{j : avail(j)mod2 = and avail(s)mod2 = 0 for length(x i ) s j + } end if for length(x i ) m b do avail(m) avail(m) end for for m > length(x i ) do avail(m) avail(m) 2 m length(x i) end for the probability of symbol x i. Algorithm 2 consists of two steps: the first step (first 3 lines) is for assigning a codeword for x i. The second step is for updating function avail(j). In the first step, when assigning a codeword for x i ( i < n), it is guaranteed that length(x i ) length(x i ) for symbol x i (i < i n) with probability p(x i ). This strategy always reserves just enough leaves (codewords) for potential symbols (symbols that have not been encountered yet) with high probabilities, thus the expected length is reduced as much as possible. To achieve this, we first evaluate the required number of codewords at the level to which x i will belong (i.e., level length(x i )). This number is i #of CodewordsRequired = p(x k) i + + = p(x k) where i p(x k) is the maximum number of symbols with the same or larger probability than that may be encountered in the future. We guarantee that they will be assigned codewords not longer than the one assigned to x i. The first + means one node is needed 3 (4)

14 for x i. The second + means one node at level length(x i ) must be reserved for the future symbols with probabilities less than. The decedents of this node will be codewords for those symbols. Algorithm 2 then checks the function avail(j) and finds the smallest j that satisfies avail(j) #of CodewordsRequired. Any leaf on level j can be the codeword for x i. Without loss of generality Algorithm 2 chooses the leftmost one. In the second step, Algorithm 2 updates function avail(j) according to the following rule: the nodes that are parents of the leaf that is used for coding x i are no longer available, and the children of the leaf that is used for coding x i are no longer available either. The whole process for encoding a source is given in Figure. Table 2 shows a simple example. The source contains 5 symbols. The probability p(x ) of the first symbol is 0.20, thus #ofcodewordsrequired = 0.20 = 5. Since initially avail(2) = 4 < 5 and avail(3) = 8 5, we will assign a 3-bit codeword to x. According to the algorithm the leftmost available node at level-3 of the tree, i.e., 000, is chosen to be the codeword for x. Since 000 has been used, nodes that are prefix of 000 (i.e., 0 and 00) and have 000 as prefix (such as 0000, 000, 000 etc.) cannot be used for codewords anymore. Therefore the updated function avail(j) should be: avail() = ( can be a codeword), avail(2) = 3 ( 0, 0, and can be codewords), avail(3) = 7 (all 3-bit sequences expect 000 can be codewords), avail(4) = 4 (all 4-bit sequences expect 0000 and 000 can be codewords) etc. We continue to assign codewords for x 2 through x 5 and update avail(j) accordingly. The final codeword assignment is illustrated in Figure 2. Similar to Shannon-Fano-Elias coding, for the proposed coding technique the probabilities of symbols can be input in any order. Therefore there exist n! different permutations of symbols making the cryptanalysis by enumeration impractical. Another advantage of the proposed method is that, since no reordering is required, the probabilities are scanned only once. This makes the computation faster than using Huffman coding. Also notice that the 4

15 avail(j) = 2 j for j > 0 i = Get Input p(x ),p(x 2 ),, to Algorithm 2. Algorithm 2 outputs the codeword for x i End No p(x ) + p(x 2 ) + + <? Yes i = i + Figure. Encoding Process 5

16 Table 2: An Example of the Construction of the New Code i #ofcodewordrequired Length(x i ) Codeword avail(j) avail() = 2 avail(2) = 4 avail(3) = 8 avail(4) = avail() = avail(2) = 3 avail(3) = 7 avail(4) = avail() = avail(2) = 2 avail(3) = 5 avail(4) = avail() = avail(2) = 2 avail(3) = 4 avail(4) = avail() = 0 avail(2) = avail(3) = 2 avail(4) = avail() = 0 avail(2) = 0 avail(3) = 0 avail(4) = 0 6

17 0 x 2 x 4 x 5 x x 3 Figure 2. Example for the Encoding Process 7

18 probabilities of symbols that have occurred till the current symbol are needed to find the codeword for the current symbol. The probabilities of the symbols that have not yet occurred are not needed. The property could be very useful if the symbol set is large. 4 Properties In this section we first show some important properties of our new code. We then prove that the expected length satisfies the following inequality: H L ours < H + 2. (5) Theorem. The proposed codes are prefix codes. Proof. From the construction rule we see that only leaves on the binary tree are used for codewords. Internal nodes are not used for codewords. Theorem 2. Breaking a file encoded with the proposed codes by enumeration requires exponential time. Proof. The n source symbols can be input to the proposed encoding algorithm with any order. Therefore the encoding process could result in as many as n! different codeword sets. Since n! is an exponential quantity, examining all of them by exhaustive search requires exponential time. Let length(x i ) denote the length of the codeword for symbol x i ( i n). Suppose we are now assigning a codeword for symbol x i. For the previous i symbols, we have length(x k ) {j, j 2,..., j u } for k i. Without loss of generality let 0 < j < j 2 < < j u. Suppose out of the i symbols have a length of j k where k u. Obviously i = u. 8

19 Let x and y be real numbers. The following two propositions are used in the proof of Lemma 3.. If x < y then x < y If x y then x y. Lemma 3. For i n, log 2 i p(x k) u length(x i) log 2 p(x i + ) u. (6) i p(x k) Proof. By the code construction rule, before a codeword is assigned to x i, length(x i ) satisfies avail(length(x i ) ) < #ofcodewordsrequired avail(length(x i )). (7) where #of CodewordsRequired = i p(x k). We also have avail(length(xi )) avail(length(x i ) ) =. (8) 2 To find avail(length(x i )) and avail(length(x i ) ), three cases will be discussed. In Case, the length of the codeword for x i is smaller than the shortest codeword among x through x i. In Case 2, the length of the codeword for x i is larger than the shortest codeword among x through x i, and smaller than the longest codeword among x through x i. In Case 3, the length of the codeword for x i is larger than the longest codeword among x through x i. Case. length(x i ) < j. We need to compute the number of available codewords at level length(x i ) before assigning a codeword to x i. In order to do this, we first find the number of nodes at level length(x i ) that cannot be used as codewords because they are either codewords themselves or they are parents of nodes representing codewords at levels lower 9

20 than length(x i ). The lowest level of the tree is j u. The c u nodes at level j u are equivalent to nodes at level j u not being available. This number is added to c u, i.e., the c u 2 ju j u number of codewords at level j u. Likewise, the c u nodes at level j u and the c u nodes cu 2 at level j u are equivalent to ju j +c u u nodes at level j u 2 not being available. We 2 j u j u 2 continue computing like this. Finally we find that the i codewords for symbols x through x i are equivalent to cu 2 ju j +c u u 2 j u j u 2 +c u 2 2 j u 2 j u 3 2 j length(x i ) + c nodes at level length(x i ) not being available. Therefore avail(length(x i )) = 2 length(xi) u = 2 length(xi) Solving (0), (8) and (7) we get (6). = 2 length(x i) ( cu 2 ju j +c u u 2 j u j u 2 +c u 2 2 j u 2 j u 3 length(x i ) u 2 j length(x i ) ) + c (9). (0) Case 2. j r length(x i ) < j r+ where r u. If j k length(x i ) for k u, then the nodes at level j k are equivalent to 2 length(x i) j k nodes at level length(xi ) not being 20

21 available. Before assigning a codeword to x i, we have avail(length(x i )) = 2 length(x i) = 2 length(x i) = 2 length(x i) ( Solving (2), (8) and (7) we get (6). r 2 length(x i) j k r u 2 length(x i) j k u ) cu 2 ju j +c u u 2 j u j u 2 +c u 2 k=r+ 2 j u 2 j u 3 2 j r+ length(x i ) length(x i ) + c r+ (). (2) Case 3. length(x i ) j u. Before assigning a codeword to x i, we have Solving (3), (8) and (7) we get (6). avail(length(x i )) = 2 length(x i) u 2 length(x i) j k. (3) Lemma 4. For i n, i p(x k ) u. (4) Proof. We use induction. For the first symbol x, we have 2 length(x) < and hence length(x ) = p(x ) 2 length(x ) log 2 log p(x ) 2 p(x ). (5) This implies that p(x ) 2 length(x ). (6) 2

22 Assume for some integer I with I n we have I p(x k ) u. (7) where u = I. What will be shown is that I p(x k) u where u = I. Notice that I p(x k) = I p(x k) + p(x I ) and u Therefore we will prove that I p(x k ) + p(x I ) u + = u + 2 length(x I ). 2 length(x I). (8) By (7) and I p(x k) + p(x I ) we have ( I ) ( ) u I p(x k ) p(x I ) p(x k ) 0, (9) which is equivalent to p(x I ) because p(x I ) 0, u length(x I ) log 2 By (20) and (2), I ( p(x k) u I ( p(x k) p(x I ) u ) p(x I ) + I p(x k) u 0, and p(x I) + I p(x k) u ) and hence 2 length(x I) p(x I ) 2 p(x I I) + length(x I) I ( p(x k) u This implies (8). By induction we have proved (4). p(x k ) u (20) 0. By Lemma 3 ). (2). (22) Theorem 5. For i n, length(x i ) log 2 +. (23) 22

23 Proof. It is obvious that + i p(x k). By Lemma 4, we have + u and +2 u + i p(x k). This implies i p(x k)+ 2 2 u It is obvious that u 0 and p(x i) 0. Hence. i p(x k) + u 2 (24) and Therefore log 2 i p(x k) + u log 2 p(x i + ) u i p(x k) 2 log 2 = log 2 +. (25) log 2 +. (26) By Lemma 3 the left-hand-side is the upper bound of length(x i ). This proves (23). Theorem 6. For the same source, L ours < L SF E. (27) Proof. The length of the codeword for x is given by (5). For 2 i n, by Theorem 5, length(x i ) log 2 +. (28) Recall that for Shannon-Fano-Elias codes the codeword length is given by (3). Therefore L ours = n p(x k )length(x k ) < n ( ) p(x k ) log 2 + = L SF E. (29) p(x k ) Theorem 7. For a source with entropy H, H L ours < H + 2. (30) 23

24 Proof. By the definition of source entropy H = n p(x k) log 2 p(x k, it is straightforward ) from (29) that L ours < H + 2. According to Shannon s basic theorems introduced in [9], any prefix code with expected length L must satisfy L H. However, not all prefix codes can have an expected length equal to H. For example, the lower bound of Shannon-Fano-Elias codes is H + instead of H. We now show that, for our new code, if = 2 t i for i n where all t i are integers such that 0 < t t 2... t n, then L ours = H. Conditions = 2 t i and 0 < t t 2... t n imply that i p(x k) = i p(x k). (3) According to the construction rule, if p(x i ) for i < i n then length(x i ) length(x i ). Therefore this case belongs to Case 3 of Lemma 3. Then 2 length(x i ) u 2 length(x i) j k < i p(x k) u 2 length(xi) 2 length(x i) j k (32) 2 and thus and i length(x i ) = log p(x k) 2 ( u jk ). (33) We use induction. For i =, according to (5) we have length(x ) = log 2 p(x ) = t. Assume length(x i ) = log 2 for i I < n. Then p(x i) = 2 length(x i ) for i I I p(x k ) = where u = I. By (33) and (34), length(x I+ ) = log 2 p(x I+ ) u (34) = log 2 p(x I+ ) = t I+. (35) By induction length(x i ) = log 2 = t i for i n. Hence in this case L ours = H. Therefore the bounds of the expected length of the proposed code is given by (30). 24

25 Table 3: Shannon-Fano-Elias Code and Our New Code for the English Alphabet A to Z x i SFE codes new codes x i SFE codes new codes A N B O C P D Q E R F S G T H U I V J W K X L Y M Z Examples We first consider codes for the English alphabet with order A to Z. The codeword assignments for Shannon-Fano-Elias coding and the proposed coding are shown in Table 3. In this example, the source entropy is , the expected length of Huffman codes is , the expected length of Shannon-Fano-Elias codes is , and the expected length of our proposed code is We now consider codes for the English alphabet with order Z to A. The codeword assignments for Shannon-Fano-Elias coding and the proposed coding are shown in Table 4. In this case the expected length of the proposed codes is We see that the expected length of the proposed codes is much smaller than that of Shannon-Fano-Elias codes, and is pretty close to that of Huffman codes. Given that the new codes result in much better security than Huffman codes, the cost of code length increases 25

26 Table 4: Shannon-Fano-Elias Code and Our New Code for the English Alphabet Z to A x i SFE codes new codes x i SFE codes new codes Z M Y L X K W J V I U H T G S F R E Q D P C O B N A is reasonable. 6 Conclusion We have developed an innovative construction of variable-length prefix codes. To construct the proposed codes, the probabilities of source symbols do not need to be placed in nondecreasing order, unlike Huffman coding. We have shown that breaking a file encoded using the proposed code requires exponential time. Therefore compression and encryption can be combined. However, this is not the case for Huffman code, which proves to be vulnerable if the cryptanalyst knows the construction rule and PMF. The bounds of the expected code length of the new code are derived. It is shown that the proposed codes always have shorter length than that of Shannon-Fano-Elias codes. 26

27 We remark that using compression for data encryption might not be secure enough for those applications that require high level secrecy. However for some common applications, such as protecting copyrighted publications from illegal accesses, it is convenient to combine compression and encryption in order to make the cryptanalysis as expensive as the protected materials themselves. In those applications the new coding technique is very useful. References [] D. A. Huffman, A method for the construction of minimum redundency codes, Proc. IRE, vol. 40, pp , Sept [2] F. Robin, Cryptographic aspects of data compression codes, Cryptologia, vol. 3, no. 4, pp , 979. [3] S. T. Klein, A. Bookstein, and S. Deerwestern, Storing text retrieval systems on CD- ROM: compression and encryption considerations, ACM Trans. Inform. Systems, vol. 7, no. 3, pp , Jul., 989. [4] H. Lekatsas, J. Henkel, S. Chakradhar, and V. Jakkula, Cypress: compression and encryption of data and code for embedded multimedia systems, IEEE Proc. Design and Test of Computers, vol. 2, no. 5, pp , May [5] D. W. Gillman, M. Mohtashemi, and R. L. Rivest, On breaking a Huffman code, IEEE Trans. Inform. Theory, vol. 42, no. 3, pp , May 996. [6] A. S. Fraenkel and S. T. Klein, Complexity aspects of guessing prefix codes, Algorithmica, vol. 2, pp ,

28 [7] J. Yang, L. Gao, and Y. Zhang, Improving memory encryption performance in secure processors, IEEE Trans. Computers, vol. 54, no. 5, pp , May, [8] T. M. Cover and J. A. Thomas, Elements of Information Theory, John Wiley & Sons, 99. [9] C. E. Shannon, A mathematical theory of communication, Bell Sys, Tech. J., vol. 27, pp , , Jul.-Oct., 948. [0] N. Abramson, Information Theory and Coding, New York: McGraw-Hill, 963. [] B. McMillan, Two inequalities implied by unique decipherability, IRE Trans. Inform. Theory, pp. 5-6, Dec [2] K. Laković and J. Villasenor, On design of error-correcting reversible variable length codes, IEEE Commun. Letters, vol. 6, no. 8, pp , Aug

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding SIGNAL COMPRESSION Lecture 3 4.9.2007 Shannon-Fano-Elias Codes and Arithmetic Coding 1 Shannon-Fano-Elias Coding We discuss how to encode the symbols {a 1, a 2,..., a m }, knowing their probabilities,