Lecture 15: Conditional and Joint Typicaility

Similar documents
Capacity of a channel Shannon s second theorem. Information Theory 1/33

Lecture 3: Channel Capacity

Lecture 8: Channel and source-channel coding theorems; BEC & linear codes. 1 Intuitive justification for upper bound on channel capacity

Lecture 6 I. CHANNEL CODING. X n (m) P Y X

EE 4TM4: Digital Communications II. Channel Capacity

LECTURE 10. Last time: Lecture outline

Shannon s noisy-channel theorem

Lecture 5: Channel Capacity. Copyright G. Caire (Sample Lectures) 122

Lecture 4 Noisy Channel Coding

Shannon s Noisy-Channel Coding Theorem

Network coding for multicast relation to compression and generalization of Slepian-Wolf

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

EE5139R: Problem Set 7 Assigned: 30/09/15, Due: 07/10/15

(Classical) Information Theory III: Noisy channel coding

Shannon s Noisy-Channel Coding Theorem

Lecture 11: Quantum Information III - Source Coding

LECTURE 13. Last time: Lecture outline

Exercise 1. = P(y a 1)P(a 1 )

EE376A - Information Theory Midterm, Tuesday February 10th. Please start answering each question on a new page of the answer booklet.

X 1 : X Table 1: Y = X X 2

Notes 3: Stochastic channels and noisy coding theorem bound. 1 Model of information communication and noisy channel

EE376A: Homeworks #4 Solutions Due on Thursday, February 22, 2018 Please submit on Gradescope. Start every question on a new page.

Lecture 10: Broadcast Channel and Superposition Coding

Lecture 5 Channel Coding over Continuous Channels

Noisy channel communication

Lecture 4 Channel Coding

EE5585 Data Compression May 2, Lecture 27

Lecture 1: Introduction, Entropy and ML estimation

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

Entropies & Information Theory

Lecture 15: Strong, Conditional, & Joint Typicality

10-704: Information Processing and Learning Fall Lecture 9: Sept 28

Superposition Encoding and Partial Decoding Is Optimal for a Class of Z-interference Channels

A Graph-based Framework for Transmission of Correlated Sources over Multiple Access Channels

The Method of Types and Its Application to Information Hiding

18.2 Continuous Alphabet (discrete-time, memoryless) Channel

Information Theory. Lecture 10. Network Information Theory (CT15); a focus on channel capacity results

National University of Singapore Department of Electrical & Computer Engineering. Examination for

Electrical and Information Technology. Information Theory. Problems and Solutions. Contents. Problems... 1 Solutions...7

A Comparison of Superposition Coding Schemes

Solutions to Homework Set #1 Sanov s Theorem, Rate distortion

Lecture 11: Continuous-valued signals and differential entropy

A Novel Asynchronous Communication Paradigm: Detection, Isolation, and Coding

EE376A - Information Theory Final, Monday March 14th 2016 Solutions. Please start answering each question on a new page of the answer booklet.

ECE Information theory Final (Fall 2008)

Solutions to Homework Set #3 Channel and Source coding

Noisy-Channel Coding

EE/Stat 376B Handout #5 Network Information Theory October, 14, Homework Set #2 Solutions

Digital Communications III (ECE 154C) Introduction to Coding and Information Theory

Upper Bounds on the Capacity of Binary Intermittent Communication

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Feedback Capacity of a Class of Symmetric Finite-State Markov Channels

3F1 Information Theory, Lecture 3

CSCI 2570 Introduction to Nanocomputing

Lecture 11: Polar codes construction

Lecture 8: Shannon s Noise Models

Lecture 5: Asymptotic Equipartition Property

6.02 Fall 2011 Lecture #9

The Gallager Converse

Lecture 2. Capacity of the Gaussian channel

LECTURE 15. Last time: Feedback channel: setting up the problem. Lecture outline. Joint source and channel coding theorem

Reliable Computation over Multiple-Access Channels

Shannon s A Mathematical Theory of Communication

1 Introduction to information theory

ELEC546 Review of Information Theory

The Communication Complexity of Correlation. Prahladh Harsha Rahul Jain David McAllester Jaikumar Radhakrishnan

Midterm Exam Information Theory Fall Midterm Exam. Time: 09:10 12:10 11/23, 2016

On Large Deviation Analysis of Sampling from Typical Sets

Appendix B Information theory from first principles

Lecture Notes for Communication Theory

Lecture 18: Quantum Information Theory and Holevo s Bound

Lecture 22: Final Review

The Capacity Region for Multi-source Multi-sink Network Coding

UCSD ECE 255C Handout #12 Prof. Young-Han Kim Tuesday, February 28, Solutions to Take-Home Midterm (Prepared by Pinar Sen)

Physical Layer and Coding

Chapter 9 Fundamental Limits in Information Theory

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

Chaos, Complexity, and Inference (36-462)

ECE Advanced Communication Theory, Spring 2009 Homework #1 (INCOMPLETE)

The Poisson Channel with Side Information

Simultaneous Nonunique Decoding Is Rate-Optimal

A Comparison of Two Achievable Rate Regions for the Interference Channel

On Multiple User Channels with State Information at the Transmitters

(each row defines a probability distribution). Given n-strings x X n, y Y n we can use the absence of memory in the channel to compute

Variable Length Codes for Degraded Broadcast Channels

Lecture 4: Proof of Shannon s theorem and an explicit code

Information Theory. M1 Informatique (parcours recherche et innovation) Aline Roumy. January INRIA Rennes 1/ 73

Lecture 8: Channel Capacity, Continuous Random Variables

Two Applications of the Gaussian Poincaré Inequality in the Shannon Theory

Multiaccess Channels with State Known to One Encoder: A Case of Degraded Message Sets

Distributed Source Coding Using LDPC Codes

Frans M.J. Willems. Authentication Based on Secret-Key Generation. Frans M.J. Willems. (joint work w. Tanya Ignatenko)

Information Theory: Entropy, Markov Chains, and Huffman Coding

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

(Classical) Information Theory II: Source coding

Dept. of Linguistics, Indiana University Fall 2015

Lecture B04 : Linear codes and singleton bound

EE5319R: Problem Set 3 Assigned: 24/08/16, Due: 31/08/16

Transcription:

EE376A Information Theory Lecture 1-02/26/2015 Lecture 15: Conditional and Joint Typicaility Lecturer: Kartik Venkat Scribe: Max Zimet, Brian Wai, Sepehr Nezami 1 Notation We always write a sequence of symbols by small letter. For example x n is an individual sequence without any probability distribution assigned to it. We use capital letter for random variables, e.g. X n i.i.d. according to some distribution. Throughout, X will denote the set of possible values a symbol can take. 2 Strong δ - typical sets Before going forward, we define empirical distribution. Definition 1. Emperical Distribution: For any sequence x n, empirical distribution is the probability distribution derived for letters of the alphabet based on frequency of appearance of that specific letter in the sequence. More precisely: p x n(a) = 1 n 1(xi = a), a X (1) 2.1 Typical Set (again) A rough intuition for typical sets is that if one picks a sequence from an i.i.d distribution, p X n, then the typical set T (X) is a set of length n sequences with the following properties: 1. A sequence chosen at random will be in the typical set with probability almost one. 2. All the elements of the typical set have (almost) equal probabilities. All(sequences(of( length(n( Typical(Set( Figure 1: Space of all sequences and typical set inside. More precisely, if T (X) is the typical set, T (X) 2 nh(x) and probability of each sequence inside the typical set is 2 nh(x). So a random sequence chosen from the set looks like one chosen uniformly from the typical set. 1

3 δ- strongly typical set (recap from last lecture) Definition 2. A sequence x n is δ - typical set with respect to pmf p if: Where X is the support of the distribution. a X : p x n(a) p(a) δp(a) (2) Note that this strongly typical set is different from the other (weakly) typical set defined in first lectures. This new notion is stronger, but as we will see it retains all the desirable properties of typicality, and more. The following example illustrates difference of this strong notion and weak notion defined earlier: Example 3. Suppose that alphabet is X = {a, b, c} with the probabilities p(a) = 0.1, p(b) = 0.8 and p(c) = 0.1. Now consider two strings of length 1000: x 1000 strong = (100a, 800b, 100c) x 1000 weak = (200a, 800b) In this example, these two sequences have same probability, so they are both identical in the weak notion of typical set A (ɛ) 1000 (for some ɛ). But it is not hard to see that x1000 strong is a δ- strong typical set while the other is not (for sufficiently small δ). The strong notion is sensitive to frequency of different letters and not only the total probability of sequence. It was shown in homework 6 that that we preserve important properties of typical sets in the new strong notion. We have the following results from homework: 1. T δ (X) A ɛ (X) (i.e. strong typical sets are inside weak typical sets.) 2. Empirical probability p X n is almost equal to the probability distribution p. Therefore p(x n ) 2 nh(x) for x n in the strong typical set. 3. T δ (X) 2 nh. 4. P (x n T δ (X)) 1 as n 4 δ - jointly typical set In this section we extend the notion of δ - typical sets to pair of sequences x n = (x 1,..., x n ) and y n = (y 1,..., y n ) from alphabets X and Y. Definition 4. A pair of sequences (x n, y n ) is said to be δ - jointly typical with respect to a pmf p XY X Y if: p x n,yn(x, y) p(x, y) δp(x, y) Where p x n,yn(x, y) is the empirical distribution. on Similarly we define δ - jointly typical set T δ (X, Y ): T δ (X, Y ) = T δ (P XY ) {(x n, y n ) p x n,yn(x, y) p(x, y) δp(x, y), x X, y Y} If we look carefully, nothing is really very new. We just require that empirical distribution of pair of sequences be δ - close to the pmf p XY. It is easy to see that size of typical set is: T δ (X, Y ) 2 nh(x,y ) 2

LNIT: Information Measures and Typical Sequences (2010-06-22 08:45) Page 2 19 Useful Picture (X) =2. nh(x) x n y n (Y ) =2. nh(y ) (X, Y ). =2 nh(x,y ) (Y x n ) =2. nh(y X) (X y n ) =2. nh(x Y ) Figure 2: A useful diagram depicting typical sets, from El Gamal and Kim (Chapter 2) LNIT: Information Measures and Typical Sequences (2010-06-22 08:45) Page 2 20 and P ((x n, y n ) T δ (X, Y )) 2 nh(x,y ) In Figure 2we have depicted the sequences of the length n from alphabet X on the x axis and sequences from alphabet Y on the y axis. Then we can look inside the table and any point corresponds to a pair of sequences. We have marked the jointly typical sets with dots. It is easy to see that if a set is jointly typical then both of the sequences in the set are typical as well. Also we will see in the next section that number of dots in the column corresponding to a typical x n sequence is approximately 2 nh(y X).(Also similar statement is correct for rows). We can quickly check that this is consistent by a counting argument: We know that number of typical x n sequences is 2 nh(x) and each column there are 2 nh(y X) jointly typical pairs. So in total there are 2 nh(x) nh(y X).2 number of typical pairs. But 2 nh(x).2 nh(y X) = 2 n(h(x)+h(y X)) = 2 nh(x,y ) which is consistent to the fact that total number of jointly typical pairs is equal to 2 nh(x,y ). 5 δ-conditional typicality In many applications of typicality, we know one sequence (say, the input to a channel) and want to know typical values for some other sequence. In this case, a useful concept is conditional typicality. Definition 5. Fix δ > 0, x n X n. Then, the conditional typical set T δ (Y x n ) is defined by T δ (Y x n ) {y n Y n : (x n, y n ) T δ (X, Y )}. As usual, it is useful to have bounds on the size of this set. For all δ < δ (and n large enough, as usual), x n T δ (X) T δ (Y x n. ) = 2 nh(y X) (below, we ll be a bit more careful, and we ll see that this exponent should depend on a function ɛ(δ) which vanishes as δ 0). Note that this asymptotic behavior is does not depend on the specific sequence x n, as long as x n is typical! This is all in accordance with the intuition we have developed: all typical sequences behave roughly similarly. If x n T δ (X), then T δ (Y x n ) = 0, as (x n, y n ) cannot be jointly typical if x n is not typical. 3

To illustrate the methods used to describe the asymptotic behavior of T δ (Y x n ), we find an upper bound on this value. If x n T δ (X) T δ (X) and y n T δ (Y x n ), then by definition of the conditional typical set, (x n, y n ) T δ (X, Y ). Since strong typicality implies weak typicality, So, fix some x n T δ (X). Then, 1 (1 δ)h(x, Y ) 1 n log p(xn, y n ) (1 + δ)h(x, Y ), (3) y n T δ (Y x n ) (1 δ)h(x) 1 n log p(xn ) (1 + δ)h(x). (4) p(y n x n ) = y n T δ (Y x n ) p(x n, y n ) p(x n ) T δ (Y x n ) 2 n(h(x,y ) H(X)+ɛ(δ)) = T δ (Y x n n(h(y X)+ɛ(δ)) ) 2 T δ (Y x n ) 2 n(h(y X)+ɛ(δ)). 6 Conditional Typicality, and Encoding / Decoding Schemes for Sending Messages We now explain the idea of how these concepts relate to sending messages over channels. Consider a discrete memoryless channel (DMC), described by conditional probability distribution p(y x), where x is the input to the channel and y is the output. Then, p(y n x n ) = n p(y i x i ). Say we want to send one of M messages over a channel. We encode each m {1,..., M} into a codeword X n (m). We then send the codeword over the channel, obtaining Y n. Finally, we use a decoding rule ˆM(Y n ) which yields ˆM = m with high probability. The conditional typicality lemma (to be proved in homework 7) characterizes the behavior of this channel for large n: if we choose a typical input, then the output is essentially chosen uniformly at random from T δ (Y x n ). More precisely, for all δ < δ, x n T δ (X) implies that i=1 P (y n T δ (Y x n )) = P ((x n, Y n ) T δ (X, Y )) 1, as n ; furthermore, the probability of obtaining each element of T δ (Y x n 1 ) is essentially T δ (Y x n ) 2 nh(y X). This lemma limits the values of M for which there is an encoding / decoding scheme that can succeed with high probability. As illustrated in Figure 3, what we are doing is choosing a typical codeword x n T δ (X), and receiving some element Y n T δ (Y x n ). We can think of this latter set as a noise ball : it is the set of outputs y n that we could typically expect to receive, given that our input is x n. If these noise balls corresponding to different inputs overlap significantly, then we have no hope for being able to obtain m from Y n with high probability, as multiple inputs give indistinguishable outputs. Since, for any input, the output will (with high probability) be typical that is, Y n T δ (Y ), the number of messages we can send is limited by the number of noise balls we can fit inside of T δ (Y ). Since the number of elements of T δ (Y ) is (approximately) 2 nh(y ) and the number of elements of T δ (Y x n ) is (approximately) 2 nh(y X), it follows that the number of messages we can send over this channel is at most 2 nh(y ) 2 nh(y X) = 2n(H(Y ) H(Y X)) = 2 ni(x;y ) 2 nc, 4

where C is the channel capacity. Note that this argument does not give a construction that lets us attain this upper bound on the communication rate. The magic of the direct part of Shannon s channel coding theorem is that random coding lets us attain this upper bound. All(sequences(of( length(n( All(sequences(of( length(n( Typical(Set( Typical(Set( x n (2)( x n (1)( T δ (Y x n (2)) ( T δ (Y x n (1)) ( Figure 3: The typical sets T δ (X), T δ (Y ), and T δ (Y x n ). 7 Joint Typicality Lemma In this final section, we discuss the joint typicality lemma, which tells us how well we can guess the output without knowing the input. Intuitively, if X and Y are strongly correlated, then we might expect that not knowing the input could strongly impair our ability to guess the output, and if X and Y are independent then not knowing the input should not at all impair our ability to guess the output. So, say that the actual input is x n. We ll look for a bound on the probability that an output will be in the conditional typical set T δ (Y x n ) that is, the probability that we ll guess that x n was the input in terms of the mutual information I(X; Y ). Fix any δ > 0 and δ < δ, and fix x n T δ (X). Choose Ỹ n Y n by choosing each Ỹi i.i.d. according to the marginal distribution p(y). (So, intuitively we ve forgotten what we sent as input to the channel, and are simulating the output). Then, noting that y n T δ (Y x n ) y n T δ (Y ) p(y n ) 2 n(h(y ) ɛ(δ)), 5

where ɛ(δ) is a function that approaches 0 as δ 0, we have P (Ỹ n T δ (Y x n )) = P ((x n, Ỹ n ) T δ (X, Y )) = p(y n ) y n T δ (Y x n ) T δ (Y x n n(h(y ) ɛ(δ)) ) 2 2 nh(y X)+ɛ(δ) n(h(y ) ɛ(δ)) 2 n(h(y ) H(Y X) ɛ(δ)) = 2 = 2 n(i(x;y ) ɛ(δ)), where ɛ(δ) 0 as δ 0. Intuitive argument for joint typicality lemma The joint typicality lemma asserts that the probability of observing two random x n and y n sequences is roughly 2 ni(x;y ). Observe that there are roughly 2 nh(x) typical x n sequences, and 2 nh(y ) typical y n sequences. The total number of jointly typical sequences is 2 nh(x,y ). Thus, what is the probability that two randomly chosen sequences are jointly typical? 8 Next time Next lecture will be on lossy compression. 2 nh(x,y ) 2 nh(x) 2 nh(y ) = 2 ni(x;y ) (5) 6