Joint Compression and Digital Watermarking: Information-Theoretic Study and Algorithms Development

Joint Compression and Digital Watermarking: Information-Theoretic Study and Algorithms Development by Wei Sun A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Doctor of Philosophy in Electrical and Computer Engineering Waterloo, Ontario, Canada, 2006 c Wei Sun 2006

AUTHOR S DECLARATION FOR ELECTRONIC SUBMISSION OF A THESIS I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. Wei Sun ii

Abstract In digital watermarking, a watermark is embedded into a covertext in such a way that the resulting watermarked signal is robust to certain distortion caused by either standard data processing in a friendly environment or malicious attacks in an unfriendly environment. The watermarked signal can then be used for different purposes ranging from copyright protection, data authentication, fingerprinting, to information hiding. In this thesis, digital watermarking will be investigated from both an information theoretic viewpoint and a numerical computation viewpoint. From the information theoretic viewpoint, we first study a new digital watermarking scenario, in which watermarks and covertexts are generated from a joint memoryless watermark and covertext source. The configuration of this scenario is different from that treated in existing digital watermarking works, where watermarks are assumed independent of covertexts. In the case of public watermarking where the covertext is not accessible to the watermark decoder, a necessary and sufficient condition is determined under which the watermark can be fully recovered with high probability at the end of watermark decoding after the watermarked signal is disturbed by a fixed memoryless attack channel. Moreover, by using similar techniques, a combined source coding and Gel fand-pinsker channel coding theorem is established, and an open problem proposed recently by Cox et al is solved. Interestingly, from the sufficient and necessary condition we can show that, in light of the correlation between the watermark and covertext, watermarks still can be fully recovered with high probability even if the entropy of the watermark source is strictly above the standard public watermarking capacity. We then extend the above watermarking scenario to a case of joint compression and watermarking, where the watermark and covertext are correlated, and the watermarked signal has to be further compressed. Given an additional constraint of the compression rate of the watermarked signals, a necessary and sufficient condition is determined again under which the watermark can be fully recovered with high probability at the end of public iii

watermark decoding after the watermarked signal is disturbed by a fixed memoryless attack channel. The above two joint compression and watermarking models are further investigated under a less stringent environment where the reproduced watermark at the end of decoding is allowed to be within certain distortion of the original watermark. Sufficient conditions are determined in both cases, under which the original watermark can be reproduced with distortion less than a given distortion level after the watermarked signal is disturbed by a fixed memoryless attack channel and the covertext is not available to the watermark decoder. Watermarking capacities and joint compression and watermarking rate regions are often characterized and/or presented as optimization problems in information theoretic research. However, it does not mean that they can be calculated easily. In this thesis we first derive closed forms of watermarking capacities of private Laplacian watermarking systems with the magnitude-error distortion measure under a fixed additive Laplacian attack and a fixed arbitrary additive attack, respectively. Then, based on the idea of the Blahut-Arimoto algorithm for computing channel capacities and rate distortion functions, two iterative algorithms are proposed for calculating private watermarking capacities and compression and watermarking rate regions of joint compression and private watermarking systems with finite alphabets. Finally, iterative algorithms are developed for calculating public watermarking capacities and compression and watermarking rate regions of joint compression and public watermarking systems with finite alphabets based on the Blahut-Arimoto algorithm and the Shannon s strategy. iv

Acknowledgments First and foremost, I would like to express my sincere gratitude to my supervisor Professor En-hui Yang at the University of Waterloo. His broad knowledge, insightful understanding, encouragement and financial support are invaluable for me to successfully finish the thesis. It is really one of the greatest experiences in my life to be supervised by Dr. Yang, and what I learned from him will definitely benefit me in future. Second, I gratefully acknowledge my doctoral committee members, Professor Frans M. J. Willems of the Department of Electrical Engineering, Eindhoven University of Technology, Netherlands, Professor Jiahua Chen of the Department of Statistics and Actuarial Science, University of Waterloo, Professor Jon W. Mark of the Department of Electrical and Computer Engineering, University of Waterloo, and Professor Amir K. Khandani of the Department of Electrical and Computer Engineering, University of Waterloo, for their valuable time, helpful advice and important comments. Thanks are also given to my comprehensive committee member Professor G. Gong of the Department of Electrical and Computer Engineering, University of Waterloo. Third, I am grateful to Professor Ingemar J. Cox of the Department of Electronic and Electrical Engineering, University College London, and Professor L. L. Xie of the Department of Electrical and Computer Engineering, University of Waterloo, for their valuable discussions. Four, I am deeply indebted to my friends of the Multimedia Communications Laboratory at the University of Waterloo, Dr. Yunwei Jia (now with Gennum Co.), Alexei Kaltchenko ( now with Wilfrid Laurier University), Dake He ( now with IBM T. J. Watson Research Center), Dr. Guixing Wu, Dr. Haiquan Wang, Mr. Xiang Yu, Dr. Lusheng Chen, Mr. Xudong Ma and many other friends, for their help, discussions. Last, but certainly not the least, I am much obliged to my wife Yanqiu Yang, for her love, understanding and support during the past difficult years in Canada, and my sons Yixiong and David, for their cuteness, laugh, and happiness brought to me. v

To my wife Yanqiu Yang and my sons Yixiong and David vi

Contents 1 Introduction 1 1.1 Digital Watermarking.............................. 1 1.2 Research Problems and Motivations...................... 2 1.3 Thesis Organization and Contributions.................... 4 1.4 Notations.................................... 6 2 On Information Embedding When Watermarks and Covertexts Are Correlated 9 2.1 Basic Communication Model of Digital Watermarking............ 9 2.2 Problem Formulation and Result Statement................. 12 2.2.1 Problem Formulation.......................... 12 2.2.2 Statement of Main Result....................... 14 2.3 Evaluation and Examples........................... 17 2.4 Application to the Gel fand and Pinsker s Channel.............. 22 2.5 Solution to the Cox s Open Problem..................... 26 2.6 Proof of Direct Part.............................. 29 2.6.1 Preliminaries on Typicality....................... 29 2.6.2 Watermarking Coding Scheme..................... 31 2.6.3 Analysis on Averaged Error Probability................ 33 2.6.4 Analysis of Distortion Constraint................... 39 vii

2.6.5 Existence of Watermarking Encoders and Decoders......... 40 2.6.6 Proof of Corollary 2.1......................... 41 2.7 Proof of the Converse Part........................... 41 2.8 Summary.................................... 45 3 Joint Compression and Information Embedding When Watermarks and Covertexts Are Correlated 47 3.1 Introduction................................... 47 3.2 Problem Formulation and Result Statement................. 50 3.2.1 Problem Formulation.......................... 50 3.2.2 Main Result and Discussion...................... 52 3.3 Properties of R correlated w (D, R c )......................... 53 3.4 Proof of the Direct Part............................ 55 3.4.1 Random Joint Compression and Watermarking Coding....... 56 3.4.2 Averaged Error Probability...................... 58 3.4.3 Distortion Constraint and Compression Rate Constraint...... 59 3.4.4 Existence of Watermarking Encoders and Decoders......... 60 3.5 Proof of the Converse Part........................... 62 3.6 Summary.................................... 66 4 Information Embedding with Fidelity Criterion for Watermarks 69 4.1 Problem Formulation and Main Results.................... 69 4.2 Proof of Theorem 4.1.............................. 72 4.2.1 Watermarking Coding Scheme..................... 73 4.2.2 Distortion Constraint for Watermarking Encoders.......... 74 4.2.3 Distortion Constraint for Watermark Decoders............ 77 4.2.4 Existence of Watermarking Encoders and Decoders......... 81 4.3 Proof of Theorem 4.2.............................. 82 viii

4.3.1 Watermarking Coding Scheme..................... 83 4.3.2 Distortion Constraint for Watermarking Encoders.......... 85 4.3.3 Compression Rate Constraint for Watermarking Encoders..... 88 4.3.4 Averaged Distortion Constraint for Watermarks........... 89 4.3.5 Existence of Watermarking Encoders and Decoders......... 89 4.4 Summary.................................... 90 5 Closed-Forms of Private Watermarking Capacities for Laplacian Sources 91 5.1 Setting of Watermarking Models and Main Results............. 92 5.2 Watermarking Capacities Under Additive Laplacian Noise Attacks.... 95 5.3 Watermarking Capacities Under Additive Noise Attacks........... 99 5.4 Summary.................................... 101 6 Algorithms for Computing Joint Compression and Private Watermarking Rate Regions 103 6.1 Introduction................................... 103 6.2 Formulation of Joint Compression and Private Watermarking Rate Regions 105 6.3 Algorithm A for Computing Private Watermarking Capacities....... 106 6.3.1 Properties of C(D)........................... 106 6.3.2 Algorithm A............................... 109 6.3.3 Convergence of Algorithm A...................... 110 6.4 Algorithm B for Computing Compression Rate Functions.......... 112 6.4.1 Properties of Compression Rate Functions.............. 112 6.4.2 Algorithm B............................... 116 6.4.3 Convergence of Algorithm B...................... 118 6.5 Summary.................................... 122 7 Algorithms for Computing Joint Compression and Public Watermarking Rate Regions 123 ix

7.1 Formulation of Joint Compression and Public Watermarking Rate Regions 123 7.2 Computing Public Watermarking Capacities................. 125 7.3 Computing Compression Rate Functions................... 131 7.4 Summary.................................... 136 8 Conclusions and Future Research 137 8.1 Conclusions................................... 137 8.2 Directions for Future Research......................... 139 Bibliography 141 x

List of Figures 2.1 Basic communication model of digital watermarking............. 10 2.2 Model of watermarking system with correlated watermarks and covertexts 13 2.3 The region of (α, β)............................... 25 2.4 Algorithm for optimal solution to Cox s problem............... 28 3.1 Model of joint compression and watermarking system............ 48 3.2 Model of joint compression and watermarking system with correlated watermarks and covertexts............................ 50 5.1 Model of private Laplacian watermarking systems.............. 92 xi

Chapter 1 Introduction 1.1 Digital Watermarking The development of the Internet has made it much easier to access digital data than ever as audio, videos and other works become available in digital form. With Internet connection, one can easily download and distribute perfect copies of pictures, music, and videos; with suitable softwares, one can also alter these copyright-protected digital media. All these activities can be carried out by would-be pirates without paying appropriate compensation to the actual copyright owners, resulting in a huge economic risk to content owners. Thus, there is a strong need for techniques to protect the copyright of content owners. Cryptography and digital watermarking are two complementary techniques proposed so far to protect digital content. Cryptography is the processing of information into an encrypted form for the purpose of secure transmission. Before delivery, the digital content is encrypted by the owner by using a secret key. A corresponding decryption key is provided only to a legitimate receiver. The encrypted content is then transmitted via Internet or other public channels, and it will be meaningless to pirate without the decryption key. At the receiver end, however, once the encrypted content is decrypted, it has no protection anymore. 1

On the other hand, digital watermarking is a technique that can protect the digital content even after it is decrypted. In digital watermarking, a watermark is embedded into a covertext (the digital contents to be protected), resulting in a watermarked signal called stegotext which has no visible difference from the covertext. In a successful watermarking system, watermarks should be embedded in such a way that the watermarked signals are robust to certain distortion caused by either standard data processing in a friendly environment or malicious attacks in an unfriendly environment. In other words, watermarks still can be recovered from the attacked watermarked signal (called forgery) generated by an attacker if the attack is not too much. A watermarking system is called private if the covertext is available to both the watermark encoder and decoder, and public if the covertext is available only to the watermark encoder. The application of digital watermarking is very broad, including copyright protection, information hiding, fingerprinting, etc. For more detailed introduction and applications of digital watermarking, please refer to [10] and [25]. 1.2 Research Problems and Motivations From an information theoretic viewpoint, a major research problem on digital watermarking is to determine best tradeoffs among the distortion between the covertext and stegotext, the distortion between the stegotext and forgery, the watermark embedding rate, the compression rate and the robustness of the stegotext. Along this direction, some information theoretic results, such as watermarking capacities and watermarking error exponents and joint compression and watermarking rate regions, have been determined. Please refer to [5, 7, 19, 25, 26, 27, 32, 33] and references therein for more information theoretic results, and [25] is an excellent summary of the state of art. The research problems to be investigated in this thesis are: From the viewpoint of information theory, for public digital watermarking systems 2

with correlated watermarks and covertexts, what s the best tradeoff among distortion level, compression rate, robustness of stegotexts and admissibility of joint watermark and covertext sources? Or under what conditions can watermarks be conveyed successfully to watermark decoder with high probability? From the viewpoint of computation, how can watermarking capacities and compression and watermarking rate regions of joint compression and watermarking systems be calculated efficiently? The motivations for the above research problems are two-fold. First, in existing informationtheoretic works on digital watermarking systems, the watermark to be embedded is often assumed explicitly or implicitly independent of the covertext. In some cases, for instance, a self watermarking system in which watermarks are extracted from covertexts by feature extraction techniques, however, the watermark to be embedded is correlated with the covertext. Without utilizing this correlation, a simple scheme for embedding such a watermark is to first compress the watermark and then embed the compressed watermark into the covertext as usual. If the entropy of the watermark is less than the standard public watermarking capacity, then the watermark can be recovered with high probability after watermark decoding in the case of public watermarking. Now the question is: in light of the correlation between the watermark and covertext, can one do better in the case of public watermarking? In other words, can the watermark still be recovered with high probability if its entropy is strictly above the standard public watermarking capacity? Furthermore, in many applications, watermarked signals are stored and/or transmitted in compressed formats, and/or the reproduced watermark at the end of decoding is allowed to be within certain distortion of the original watermarks, so, in these scenarios under what conditions can watermarks be conveyed to watermark decoder with high probability of success? Second, although watermarking capacities and compression and watermarking rate regions of joint compression and watermarking systems can be characterized as optimization problems, the characterization does not mean that they can be calculated easily. Indeed, 3

solving these optimization problem is often very difficult, and closed forms of watermarking capacities and joint compression and watermarking rate regions are known only to very few cases. Therefore, it is important and necessary to develop efficient algorithms for numerically computing watermarking capacities and joint compression and watermarking rate regions. 1.3 Thesis Organization and Contributions This thesis will study digital watermarking systems from an information theoretic viewpoint and from a computational viewpoint, respectively. From the viewpoint of information theory, we investigate a digital watermarking scenario with correlated watermarks and covertexts in Chapter 2, Chapter 3 and Chapter 4; from the viewpoint of numerical computation we obtain closed-forms of private watermarking capacities for Laplacian watermarking systems in Chapter 5 and propose iterative algorithms for numerically calculating watermarking capacities and joint compression and watermarking rate regions for private watermarking and public watermarking in Chapter 6 and Chapter 7, respectively. The organization and contributions of this thesis are summarized as follows. In Chapter 2, from the information theoretic viewpoint we study a new digital watermarking scenario with correlated watermarks and covertexts. In the case of public watermarking where the covertext is not accessible to the watermark decoder, a necessary and sufficient condition is determined under which the watermark can be fully recovered with high probability at the end of watermark decoding after the watermarked signal is disturbed by a fixed memoryless attack channel. Moreover, by using similar techniques, a combined source coding and Gel fand-pinsker channel coding theorem is established, and an open problem proposed recently by Cox et al is solved. Interestingly, from the sufficient and necessary condition we can show that, in light of the correlation between the watermark and covertext, watermarks still can be fully recovered with high probability even if the entropy of the watermark source is strictly above the standard public watermarking 4

capacity. In Chapter 3, the watermarking scenario of Chapter 2 is extended to a case of joint compression and public watermarking, where the watermark and covertext are correlated, and the watermarked signal has to be further compressed. For a given distortion level between the covertext and the watermarked signal and a given compression rate of the watermarked signal, a necessary and sufficient condition is determined again under which the watermark can be fully recovered with high probability at the end of watermark decoding after the watermarked signal is disturbed by a fixed memoryless attack channel and the covertexts is not available to the watermark decoder. The above two joint compression and watermarking models are further investigated in Chapter 4 under a less stringent environment where the reproduced watermark at the end of decoding is allowed to be within certain distortion of the original watermark. Sufficient conditions are determined for the case without compression of watermarked signals and for the case with compression of watermarked signals, respectively, under which watermarks can be reproduced within a given distortion level with respect to the original watermarks at the end of public watermark decoding after the watermarked signals are disturbed by a fixed memoryless attack channel. From the viewpoint of computation, Chapter 5 derives closed-forms for watermarking capacities of private Laplacian watermarking systems with the magnitude-error distortion measure under a fixed additive Laplacian attack and a fixed arbitrary additive attack, respectively. Based on the idea of the Blahut-Arimoto algorithm for computing channel capacities and rate distortion functions, two iterative algorithms are proposed in Chapter 6 which can be combined to calculate private watermarking capacities and joint compression and private watermarking rate regions. Similarly, based on both the Blahut-Arimoto algorithm and Shannon s strategy, in Chapter 7 iterative algorithms are proposed for calculating public watermarking capacities and joint compression and public watermarking rate regions. 5

The last chapter, Chapter 8, is the conclusion of the thesis, and contains some future works. 1.4 Notations Throughout the thesis, the following notations are adopted. We use capital letter to denote random variable, lowercase letter for its realization, and script letter for its alphabet. For example, S is a random variable over its alphabet S and s S is a realization. We use p S (s) to denote the probability distribution of a discrete random variable S taking values over its alphabet S, that is, p S (s) def = Pr{S = s}; the same notation p S (s) also is used to denote the probability density function of a continuous random variable S. If no ambiguity, the subscript in p S (s) is omitted and write p S (s) as p(s). Similarly, S n = (S 1, S 2,..., S n ) denotes a random vector taking values over S n, and s n = (s 1, s 2,..., s n ) is a realization. Also, we always assume the attack is fixed and given by a conditional probability distribution p(y x) with input alphabet X and output alphabet Y. Notations frequently used in this thesis are summarized as follows. 6

List of notations M Watermark alphabet M, M n Watermarks ˆM Reproduction watermark alphabet ˆM, ˆM n Decoded watermarks S Covertext alphabet S n X X n Y Y n p(y x) f n (M, S n ), f n (M n, S n ) g n (S n, Y n ) g n (Y n ) p(s) Covertexts Stegotext alphabet Stegotexts Forgery alphabet Forgeries An attack channel with input alphabet X and output alphabet Y Watermark encoder Private watermark decoder Public watermark decoder The pmf of a covertext source S p(m, s) The joint pmf of a joint watermark and covertext source (M, S) d, d 1 Distortion measures D, D 1 Distortion levels E The expectation operator C(D) The watermarking capacity H(X) The entropy of X I(X; Y ) The mutual information between X and Y R c R w The compression rate of stegotexts The watermarking rate 7

Chapter 2 On Information Embedding When Watermarks and Covertexts Are Correlated In this chapter, the standard model of digital watermarking is introduced first from an information theoretic viewpoint. Then, the main problem on watermarking models with correlated watermarks and covertexts is formulated and the results of this chapter are stated. Next, by employing a similar approach, a combined source-channel coding theorem on Gel fand-pinsker channel is obtained, and an open problem proposed by Cox et al is solved. Finally, the proofs of the main results are provided. 2.1 Basic Communication Model of Digital Watermarking From an information theoretic viewpoint, a digital watermarking system can be modeled as a communication system with side information at the watermark transmitter, as depicted in Figure 2.1. In this model, a watermark M is assumed to be a random variable uniformly 9

M encoder attack decoder n n X p( y x) Y f n g n ˆM n S Figure 2.1: Basic communication model of digital watermarking taking values over M = {1, 2,..., M }, and a covertext S n is a sequence of independent and identical drawings of a random variable S with probability distribution p(s) taking values over a finite alphabet S. If the covertext S n is available to the watermarking decoder, then the watermarking model is called private; otherwise, if the covertext S n is not available to the watermarking decoder, then the watermarking model is called public. Let X be a finite alphabet, and define a distortion measure d : S X [0, ) and let d max = max s S,x X d(s, x). Without loss of generality, assume that max min s S x X d(s, x) = 0. Let {d n : n = 1, 2,...} be a single-letter fidelity criterion generated by d, where d n : S n X n [0, ) is a mapping defined by d n (s n, x n ) = 1 n n d(s i, x i ) i=1 for any s n S n and x n X n. Without ambiguity, the subscript n in d n is omitted throughout the thesis. Let p(y x) be a conditional probability distribution with input alphabet X and output alphabet Y and p(y n x n ) = n i=1 p(y i x i ) denote a fixed memoryless attack channel with input x n and output y n. 10

Definition 2.1 A watermarking encoder of length n with distortion level D with respect to the distortion measure d is a mapping f n from M S n to X n such that Ed(S n, X n ) D, where the watermarked signal x n = f n (m, s n ) is called stegotext. Moreover, R = 1 log M is called its watermarking rate. n Definition 2.2 A mapping g n : S n Y n M, ˆm = g n (s n, y n ) is called a private watermarking decoder of length n; A mapping g n : Y n M, ˆm = g n (y n ) is called a public watermarking decoder of length n. Here, the forgery y n is generated by the attacker according to the attack channel p(y n x n ) with input covertext x n. Given a watermarking encoder and watermarking decoder pair (f n, g n ), the error probability of watermarking is defined by p e (f n, g n ) = Pr{ ˆM M}. Definition 2.3 A rate R 0 is called privately (publicly) achievable with respect to distortion level D if for arbitrary ɛ > 0, there exists, for any sufficiently large n, a watermarking encoder f n with rate R ɛ and distortion level D + ɛ and a private (public) watermarking decoder g n such that p e (f n, g n ) < ɛ. The supremum of all privately (publicly) achievable rates R with respect to distortion level D is called the private (public) watermarking capacity of the watermarking system, and denoted by C private (D) and C public (D) respectively. From an information theoretic viewpoint, a major research problem is to determine the best tradeoffs among the distortion D between covertexts and stegotexts, the watermarking embedding rate R, and the robustness of the stegotext. Along this direction, some information theoretic results have been determined (see [25, 26] and references therein). In existing information theoretic works on digital watermarking, the watermark to be embedded is often assumed independent of the covertext. In some cases, however, the watermark to be embedded is correlated with the covertext. For instance, there exist 11

self-watermarking systems in which watermarks are extracted from covertexts by feature extraction techniques; another application is to embed fingerprints into passport s picture for the sake of security. Obviously, without utilizing this correlation, a simple scheme for embedding such a watermark is to first compress the watermark and then embed the compressed watermark into the covertext as usual (i.e., treating the compressed watermark as being independent of the covertext even though it is not). If the entropy of the watermark is less than the standard watermarking capacity, then the watermark can be recovered with high probability after watermark decoding. Now the question is: in light of the correlation between the watermark and covertext, can one do better? In other words, can the watermark still be recovered with high probability even if its entropy is strictly above the standard watermarking capacity? In this chapter, we shall answer the above question by determining a necessary and sufficient condition under which the watermark can be recovered with high probability at the end of watermark decoding in the case of public watermarking. It turns out that the answer is actually affirmative. When the watermark and covertext are correlated, the watermark can indeed be recovered with high probability even when its entropy is strictly above the standard public watermarking capacity. 2.2 Problem Formulation and Result Statement 2.2.1 Problem Formulation The model studied in this chapter is designated in Figure 2.2. Suppose {(M i, S i )} i=1 be a sequence of independent and identical drawings of a pair (M, S) of random variables with joint probability distribution p(m, s) taking values over the finite alphabet M S, that is, for any n and m n s n M n S n, p(m n, s n ) = n p(m i, s i ). i=1 12

M n watermark encoder n n compression embedding X p( y x) decoder Y ˆ n M n S Figure 2.2: Model of watermarking system with correlated watermarks and covertexts Here, m n and s n are called a watermark and a covertext respectively. As before, let p(y x) be a conditional probability distribution with input alphabet X and output alphabet Y and p(y n x n ) = n i=1 p(y i x i ) denote a fixed memoryless attack channel with input x n and output y n. It is assumed that the attack channel is known to both watermark encoder and watermark decoder. Definition 2.4 A watermarking encoder of length n with distortion level D with respect to the distortion measure d is a mapping f n from M n S n to X n such that Ed(S n, X n ) D, where the watermarked signal x n = f n (m n, s n ). Definition 2.5 A mapping g n : Y n M n is called a public watermarking decoder of length n with ˆm n = g n (y n ). Given a watermarking encoder and public watermarking decoder pair (f n, g n ), the error probability of watermarking averaged over all watermarks and covertexts is defined by p e (f n, g n ) = Pr{ ˆM n M n }. Definition 2.6 The joint probability distribution p(m, s) of a correlated watermark and covertext source (M, S) is called publicly admissible with respect to distortion level D if for any ɛ > 0, there exists, for any sufficiently large n, a watermarking encoder f n with length n and distortion level D + ɛ and a public watermarking decoder g n such that p e (f n, g n ) < ɛ. 13

An interesting problem arises naturally: under what condition is a joint probability p(m, s) of a correlated watermark and covertext source (M, S) publicly admissible? It s well known [27] that the public watermarking capacity is given by C public (D) = max [I(U; Y ) I(U; S)] (2.1) p(u,x s):ed(s,x) D where the maximum is taken over all auxiliary random variables U and X jointly distributed with S and satisfying Ed(S, X) D. Obviously, if H(M) < C public (D), then the watermark M can be recovered with high probability after watermark decoding in the case that the decoder cannot access the covertext S n ; this can be achieved by simply compressing M using H(M) number of bits and then embedding the compressed M into S using standard public watermarking schemes. In other words, H(M) < C public (D) is a sufficient condition for p(m, s) to be publicly admissible. However, it may not be necessary. Is H(M S) < C public (D) a sufficient and necessary condition, where H(M S) is the conditional entropy of M given S? Note that even though S n is available to the watermark encoder, S n can not be fully utilized to encode M n since S n is not available to the watermark decoder in the public watermarking system. One of our main problems in this chapter is to determine a sufficient and necessary condition for a joint probability distribution p(m, s) to be publicly admissible. It should be pointed out that in the case of private watermarking, one can ask a similar question when watermarks and covertexts are correlated. However, since the covertext S n is accessible to both the watermark encoder and decoder in this case, the solution to the corresponding question is straightforward; compressing M conditionally on S and then embedding the compressed watermark into S by using standard private watermarking schemes will provide one with an optimal solution. 2.2.2 Statement of Main Result As before, let p(m, s) be the joint probability distribution of a fixed correlated watermark and covertext source (M, S) taking values over M S. Let D 0 be a distortion level 14

with respect to the distortion measure d, and p(y x) be the fixed attack channel known to watermark encoder and watermark decoder. Define Rpublic correlated (D) = sup [I(U; Y ) I(U; M, S) + I(M; U, Y )] (2.2) p(u,x m,s):ed(s,x) D where the supremum is taken over all auxiliary random variables (U, X) taking values over U X, jointly distributed with (M, S) with the joint probability distribution of (M, S, U, X, Y ) given by p(m, s, u, x, y) = p(m, s)p(u, x m, s)p(y x), and satisfying Ed(S, X) D, and all mutual information quantities are determined by p(m, s, u, x, y). It can be shown later that U can be limited by U M S X +1 and so the sup in (2.2) can be replaced by max. The following theorem is the main result, which describes the sufficient and necessary condition for public admissibility of a joint probability p(m, s). Theorem 2.1 Let p(m, s) be the fixed joint probability distribution of a watermark and covertext source pair (M, S). For any D 0, if Rpublic correlated (D) > 0, then the following hold: (a) p(m, s) is publicly admissible with respect to D if H(M) < Rpublic correlated (D). (b) Conversely, p(m, s) is not publicly admissible with respect to D if H(M) > Rpublic correlated (D). Comments: i) The idea of the proof of Theorem 2.1 is based on the combination of Slepian-Wolf random binning technique [31] for source coding and Gel fand-pinsker s random binning technique [16] for channel coding. To be specific, in the decoding part of Gel fand- Pinsker s random binning technique, in addition to correctly decoding the transmitted message, an auxiliary vector U n correlated with the covertext S n is obtained. 15

This auxiliary vector U n can then be used as side information for the decoder of the Slepian-Wolf random binning coding scheme. Since the watermark is correlated with U n through the covertext S n, some gain in encoding the watermark could be obtained by exploiting this correlation. Details will be described in the following sections. ii) Since Rpublic correlated (D) > C public (D) in general when M and S are highly correlated, Theorem 2.1 implies that the well-known Shannon separation theorem may not be extended to the current case. Indeed, an example will be given in the next section to show that a watermark with entropy H(M) > C public (D) is still able to be transmitted reliably to the watermark receiver. iii) It can be shown that (U, M) is a better auxiliary random variable than U. So, as Frans Willems pointed out to me, a question whether U = (X, M) is the optimal choice for the auxiliary random variable remains open. If this would be the case then also a result from semantic coding (Willems and Kalker [41]) could be used to demonstrate admissibility, and this would then lead to the condition H(M) + I(S; X M) < I(X; Y ). What happens when H(M) = Rpublic correlated (D)? In the next section we will show that as a function of D, Rpublic correlated (D) is concave and strictly increasing over [0, D max ), where D max is the minimum D such that R correlated public following stronger result: (D) = R correlated (d max ). In view of this, we have the public Corollary 2.1 For any D [0, D max ), if Rpublic correlated (D) > 0, then p(m, s) is publicly admissible with respect to D if and only if H(M) R correlated public (D). (2.3) 16

2.3 Evaluation and Examples In this section we shall first investigate some properties of Rpublic correlated (D), and then present an example of public watermarking system with correlated watermarks and covertexts to demonstrate that transmitting a watermark reliably to the watermark receiver is still possible even when the entropy H(M) of the watermark is strictly above the standard public watermarking capacity C public (D). Property 2.1 Let p(m, s) be a fixed joint probability distribution of (M, S). Then Rpublic correlated (D) = max [I(U; Y ) I(U; M, S) + I(M; U, Y )] p(u,x m,s):ed(s,x) D where the maximum is taken over all auxiliary random variables (U, X) taking values over U X with U M S X + 1, jointly distributed with (M, S) with the joint probability distribution of (M, S, U, X, Y ) given by p(m, s, u, x, y) = p(m, s)p(u, x m, s)p(y x), and satisfying Ed(S, X) D. Proof: The proof is standard by using the Caratheodory s theorem, which can be stated as follows: Each point in the convex hull of a set A in R n is in the convex combination of n + 1 or fewer points of A. Here, we follows the approach of [26]. First, we label elements (m, s, x) of M S X by i = 1,..., t = M S X. P(M S X ) be the set of all probability distributions over M S X. functional F : P(M S X ) R t Q (F 1 (Q), F 2 (Q),..., F t (Q)) Let Define a where Q is a generic probability distribution over M S X, and for i = 1, 2,..., t 1 and F i (Q) = Q(m, s, x), if i = (m, s, x), F t (Q) = H Q (Y ) H Q (M, S) I Q (M; Y ) + H Q (M), 17

where all information quantities are determined by Q(m, s, x)p(y x). Next, let (U, X) be any random variables taking values over U X, jointly distributed with (M, S) with the joint probability distribution p(m, s, u, x, y) = p(m, s)p(u, x m, s)p(y x) and satisfying Ed(S, X) D. Then, for each u U, p(m, s, x u) derived from p(m, s, u, x, y) is an element of P(M S X ), and the set {F (p(m, s, x u)) u U} R t. By the Caratheodory s theorem, there exist t+1 elements u i U and t+1 numbers α i 0 with i α i = 1 such that that is, t+1 p(u)f (p(m, s, x u)) = α i F (p(m, s, x u i )), u i=1 t+1 p(u)p(m, s, x u) = α i p(m, s, x u i ), (m, s, x) u i=1 H(Y U) H(M, S U) I(M; Y U) + H(M) = t+1 α i [H(Y u i ) H(M, S u i ) I(M; Y u i ) + H(M)]. i=1 Now define a new random variable U 0 {u 1, u 2,..., u t+1 } with the joint probability p(m, s, u i, x, y) = α i p(m, s, x u i )p(y x). It is easy to check that for this new random variable Ed(S, X) D and I(U; Y ) I(U; M, S) + I(M; U, Y ) = I(U 0 ; Y ) I(U 0 ; M, S) + I(M; U 0, Y ). This finished the proof of Property 2.1. Property 2.2 Rpublic correlated (D) as a function of D is concave and continuous in [0, ). Proof: First, for any random variables (M, S, U, X, Y ), we can write I(U; Y ) I(U; M, S) + I(M; U, Y ) = H(Y ) H(Y U) H(M, S) + H(M, S U) +H(M) + H(U, Y ) H(M, U, Y ) = H(Y ) H(M, S) + H(M, S U) + H(M) H(M, Y U). 18

Now for any D 1, D 2 0, let (M, S, U i, X i, Y i ), i = 1, 2 be random variables achieving Rpublic correlated (D i ). For any λ 1, λ 2 0 with λ 1 + λ 2 = 1, let T {1, 2} be a random variable independent of all other random variables with λ i = Pr{T = i}. Define new random variables U = (U T, T ), X = X T, Y = Y T. Then by the construction of (M, S, U i, X i, Y i ) it is easy to check that Ed(S, X) λ 1 D 1 + λ 2 D 2. In view of the definition of Rpublic correlated (D), we then have R correlated public (λ 1 D 1 + λ 2 D 2 ) I(U; Y ) I(U; M, S) + I(M; U, Y ) = H(Y ) H(M, S) + H(M, S U) + H(M) H(M, Y U) λ 1 (H(Y 1 ) H(M, S) + H(M, S U 1 ) + H(M) H(M, Y 1 U 1 )) +λ 2 (H(Y 2 ) H(M, S) + H(M, S U 2 ) + H(M) H(M, Y 2 U 2 )) = λ 1 Rpublic correlated (D 1 ) + λ 2 Rpublic correlated (D 2 ) where the last inequality follows from the concavity of entropy, that is, H(Y ) λ 1 H(Y 1 )+ λ 2 H(Y 2 ). This proves that Rpublic correlated (D) is concave, which in turn implies the continuity of R correlated public D = 0. has (D) in (0, ). What remains is to show that R correlated (D) is continuous at In view of its definition, Rpublic correlated (D) is clearly non-decreasing in D. Therefore one public Rpublic correlated (0) lim Rpublic correlated (D n ) (2.4) n for D n 0. In view of Property 1, let (M, S, U n, X n, Y n ), n = 1, 2,, denote a random vector achieving Rpublic correlated (D n ) and satisfying Ed(S, X n ) D n with U n taking values in an alphabet, say, U = {1, 2,, M S X +1}. Consider a subsequence {(M, S, U ni, X ni, Y ni )} which converges in distribution to, say, {(M, S, U, X, Y )}. Since D ni 0, we have 19

Ed(S, X) = lim ni Ed(S, X ni ) = 0. From the definition of Rpublic correlated (D), it then follows that R correlated public (0) I(U; Y ) I(U; M, S) + I(M; U, Y ) = lim n i [I(U n i ; Y ni ) I(U ni ; M, S) + I(M; U ni, Y ni )] = lim n i Rcorrelated public (D ni ) = lim n R correlated public (D n ). (2.5) Combination of (2.4) and (2.5) yields the continuity of R correlated public (D) at D = 0. Property 2.3 Define D max = min { D Rpublic correlated (D) = Rpublic correlated (d max ) }. Then R correlated public (D) as a function of D is strictly increasing in [0, D max ). Proof: Since R correlated public (D) is non-decreasing in D, the concavity of R correlated (D) guarantees that it is strictly increasing in [0, D max ). The following example shows the existence of a public watermarking system with correlated watermarks and covertexts for which transmitting watermarks M n to the watermark receiver is successful with high probability, although the entropy H(M) is strictly greater than the standard public watermarking capacity C public (D). Example: Assume all alphabets are binary, that is, M = S = X = Y = {0, 1}, and the covertext source S is a Bernoulli source with parameter 1/2. The distortion measure d is the Hamming distance, and the attack channel p(y x) is a binary symmetric channel with error parameter p = 0.01. Let D = 0.01. If watermarks and covertexts are independent, Moulin and O Sullivan [26] computed its public watermarking capacity C public (D) = 0.029 public 20

nats/channel use, and showed that the optimal random variables U {0, 1, 2}, X achieving the public watermarking capacity is determined by the joint probability distribution p(s, u, x) = p(s)p U,X S (u, x s), where p U,X S (u, x s) is given by p U,X S (u = 1, x = 0 s = 0) = 0.82; p U,X S (u = 2, x = 0 s = 0) = 0.17; p U,X S (u = 0, x = 1 s = 0) = 0.01; p U,X S (u = 2, x = 0 s = 1) = 0.01; p U,X S (u = 0, x = 1 s = 1) = 0.17; p U,X S (u = 1, x = 1 s = 1) = 0.82; and all other conditional probabilities are zero. Now we assume that the watermarking source M is binary and correlated with the covertext source S with a joint probability p M,S (m, s) given by p M S (0 0) = 0.996 p M S (1 1) = 0.998. Let U, X be the random variables as above, which are conditionally independent of M given S. Then it is not hard to see that M S (U, X) Y forms a Markov chain in the indicated order, and I(M; U, Y ) H(M) + I(U; Y ) I(U; M, S) = I(M; U, Y ) H(M) + I(U; Y ) I(U; S) = I(M; U, Y ) H(M) + C public (0.01) = 0.008 > 0, which in turns implies H(M) < Rpublic correlated (D). By Theorem 2.1, p(m, s) is publicly admissible with respect to D = 0.01. On the other hand, H(M) = 0.693 > C public (D) = 0.029. Thus, we can conclude that the watermark M can be transmitted reliably to the watermark decoder even though the entropy H(M) is strictly above the standard public watermarking capacity. 21

2.4 Application to the Gel fand and Pinsker s Channel In this section we shall apply our techniques to the combined source and channel coding problem when the channel is Gel fand and Pinsker s channel and the source to be transmitted is correlated with the channel state information available only to the channel encoder, and establish a combined source coding and Gel fand and Pinsker channel coding theorem. An example is calculated to demonstrate the gain of information rate obtained by the correlation of the channel state source and the information message source. It should be mentioned that the model considered in this section is different from that of [24], in which the message is independent of the state information of the Gel fand and Pinsker s channel and the Gel fand and Pinsker s channel and the Wyner-Ziv channel are separated. As a result, the separation theorem holds for the model in [24] while it does not hold for the model in this section. To begin with, we review the Gel fand-pinsker s channel and the Gel fand and Pinsker s coding theorem. In their famous paper [16], Gel fand and Pinsker studied a communication system with channel state information non-causally available only to the transmitter, and determined its channel capacity by giving a coding theorem. To be specific, let K = {(p(y x, s), p(s)) : y Y, x X, s S} be a stationary and memoryless channel with input alphabet X, output alphabet Y and the set of channel states S, and let the channel state source S and the message source M be independent. If the state information s n is only available to the transmitter, then the channel capacity is equal to [16] C G P = max[i(u; Y ) I(U; S)], (U,X) where the maximum is taken over all random vectors (U, X) U X such that the joint probability of (U, S, X, Y ) is given by p(u, s, x, y) = p(s)p(u, x s)p(y x, s). Moreover, U S + X. Note that the independence between the channel state source S and the message source 22

M is assumed in the Gel fand-pinsker s coding theorem. Now we assume the channel state source S and the information message source M are correlated with a joint probability distribution p(m, s) and the state information S n is uncausally available only to the transmitter, and define R G P = max[i(u; Y ) I(U; M, S) + I(M; U, Y )] (U,X) where the maximum is taken over all random variables (U, X) U X such that the joint probability of (M, S, U, X, Y ) is given by p(m, s, u, x, y) = p(m, s)p(u, x m, s)p(y x, s), and U S M + X. If the public admissibility of p(m, s) is defined in a similar manner, then we have the following combined source coding and Gel fand and Pinsker channel coding theorem. Theorem 2.2 If R G P > 0, then the following hold: (a) p(m, s) is publicly admissible if H(M) < R G P. (b) Conversely, p(m, s) is not publicly admissible if H(M) > R G P. The proof is similar to that of Theorem 2.1, so omitted here. Note that this theorem is weaker than Corollary 2.1, since we don t know what will happen for p(m, s) if H(M) = R G P. It is not hard to see that in general, R G P > C G P when M and S are highly correlated. In the following example, we will further show the existence of a correlated message source and channel state information source, for which the message source can be transmitted 23

to the receiver reliably, even though H(M) is strictly greater than the standard Gel fand- Pinsker s channel capacity C G P. Example [16] (revisited): The channel input alphabet and the output alphabet are X = Y = {0, 1}, and the channel state alphabet S = {0, 1, 2}. Given three parameters 0 λ, p, q 1/2, the channel K is described in the following: 1. p S (0) = p S (1) = λ, p S (2) = 1 2λ; 2. p Y XS (y = 0 x = 0, s = 0) = p Y XS (y = 0 x = 1, s = 0) = 1 q, p Y XS (y = 0 x = 0, s = 1) = p Y XS (y = 0 x = 1, s = 1) = q, p Y XS (y = 0 x = 0, s = 2) = 1 p, p Y XS (y = 1 x = 0, s = 2) = p. Gel fand and Pinsker got the capacity of K as C G P = 1 2λ + 2λh(α 0 ) h(ρ(α 0 )), where h(x) = x log 2 (x) (1 x) log 2 (1 x), (2.6) ρ(α) = 2λ[α(1 q) + (1 α)q] + (1 2λ)(1 p), and 0 α 0 1 is the unique solution of the equation log 2 1 α α 1 ρ(α) = (1 2q) log 2. (2.7) ρ(α) Now suppose the information message source M is binary and correlated with the channel state information source S by a joint probability distribution p(m, s) given by p M S (0 0) = α p M S (0 1) = β p M S (0 2) = γ. 24

curve 1 region A curve 2 Figure 2.3: The region of (α, β) Let U, X be the binary random variables achieving the channel capacity C G P, as described in [16], which are conditionally independent of M given S. Then M (S, U, X) Y also forms a Markov chain. If (α, β, γ) satisfies H(M) C G P > 0, I(M; U, Y ) H(M) + C G P > 0, (2.8) then, H(M) < R G P, and by Theorem 2.2 the message source M can be transmitted reliably to the receiver, even though H(M) is strictly greater than the Gel fand-pinsker s channel capacity C G P. Now we give some numerical solutions. Let q = 0.2, p = 0.1, λ = 0.2; in this case, C G P = 0.2075. Let γ = 0.9. Figure 2.3 shows that any point (α, β) in the region A of Figure 2.3 satisfies (2.8), where the curve 1 represents f 1 (α, β) = H(M) C G P and the curve 2 represents f 2 (α, β) = I(M; U, Y ) H(M) + C G P. For example, when α = β = 0.98, we have H(M) = 0.2484 and I(M; U, Y ) + C G P H(M) = 0.028 > 0; thus, H(M) < R G P, which means that M can be transmitted reliably to the receiver, while H(M) > C G P = 0.2075. 25

2.5 Solution to the Cox s Open Problem In [11] Cox et al proposed an open problem on how to efficiently hide watermarks into correlated covertexts, which can be stated formally as follows. Let M, S, X, Y be finite alphabets, p(m) a fixed probability distribution, and { P = p (i) (m, s) } p (i) (m, s) = p(m), i = 1, 2,..., t s a finite set of joint probability distributions with the fixed marginal probability p(m). Let (M, S (i) ) denote an identically and independently distributed (iid) watermark and covertext source pair generated according to the probability distribution p (i) (m, s) P, with M serving as a watermark to be transmitted and S (i) serving as a covertext available only to the watermark transmitter, and let D be the fixed distortion level between covertexts and stegotexts. Assume that the fixed attack channel p(y x) with input alphabet X and output alphabet Y is memoryless, stationary and known to both the watermark transmitter and the watermark receiver. Let e(p (i) (m, s)) be the least number of bits of information needed to be embedded into S (i) in order for the watermark M to be recovered with high probability after watermark decoding in the case of public watermarking if S (i) is chosen as a covertext. The open problem proposed by Cox et al in [11] can be reformulated as how to choose the optimal joint probability distribution p (i 0) (m, s) in P achieving min p (i) (m,s) P e(p (i) (m, s)). With the help of Theorem 2.1 and Corollary 2.1, we are ready to solve this problem; our solution is given below in Theorem 2.3. Note that in this section, public admissibility means public admissibility with respect to D, and to emphasize on the dependence of R correlated public on p(m, s) we write R correlated public (p(m, s)) rather than R correlated (D). Theorem 2.3 Let P 1 be the set of all publicly admissible joint probability distributions p (i) (m, s) P, that is, public P 1 = {p (i) (m, s) P : H(M) Rpublic correlated (p (i) (m, s))}. 26