x log x, which is strictly convex, and use Jensen s Inequality:
|
|
- Alannah Harriet Powell
- 6 years ago
- Views:
Transcription
1 2. Information measures: mutual information 2.1 Divergence: main inequality Theorem 2.1 (Information Inequality). D(P Q) 0 ; D(P Q) = 0 iff P = Q Proof. Let ϕ(x) x log x, which is strictly convex, and use Jensen s Inequality: P (x) D( P Q) = P (x) log X Q(x) = X Q(x)ϕ ( P (x) Q(x) ) ϕ ( X Q(x) P (x) Q(x) ) = ϕ(1) = Conditional divergence The main objects in our course are random variables. The main operation for creating new random variables, and also for defining relations between random variables, is that of a random transformation: Definition 2.1. Conditional probability distribution (aka random transformation, transition probability kernel, Markov kernel, channel) K( ) has two arguments: first argument is a measurable subset of, second argument is an element of X. It must satisfy: 1. For any x X : K( x) is a probability measure on 2. For any measurable A function x K(A x) is measurable on X. In this case we will say that K acts from X to. In fact, we will abuse notation and write P X instead of K to suggest what spaces X and are 1. Furthermore, if X and are connected by the random transformation P X we will write X X. P Remark 2.1. (Very technical!) Unfortunately, condition 2 (standard for probability textbooks) will frequently not be sufficiently strong for this course. The main reason is that we want Radon-Nikodym dp X=x derivatives such as dq (y) to be jointly measurable in (x, y). See Section?? for more. Example: 1. deterministic system: = f(x) P X=x = δ f(x ) 2. decoupled system: X P X=x = P 1 Another reason for writing P X is that from any joint distribution, (on standard Borel spaces) one can extract a random transformation by conditioning on X. 20
2 3. additive noise (convolution): = X + Z with Z X P X=x = P x+z. Multiplication: X X to get = P P X: PX (x, y) = P X (y x) (x). Composition (Marginalization): P = P X, that is P X acts on to produce P : P ( y) = ( y x) ( x). x X P X Will also write P. Definition 2.2 (Conditional divergence). D(P X Q X ) = E x PX [D( X=x = ( x)p D( X=x Q P X=x Q X=x ). (2.2) x X )] (2.1) Note: H(X ) = log A D( U X P ), where U X is is uniform distribution on X. Theorem 2.2 (Properties of Divergence). 1. D(P X Q X ) = D( P X Q X ) 2. (Simple chain rule) D ( 3. (Monotonicity) D( Q X ) D( ) 4. (Full chain rule) D( Q X ) = D( P X Q X ) + D( Q X ) P Q 1 X n Q X 1 X n Xi 1 i= 1 n ) = D( i Q X 1 i X i 1 i ) In the special case of QX n = i Q X i we have D(1 X n Q X1 Q Xn ) = D(1 X n 1 n ) + D( i Q Xi ) 5. (Conditioning increases divergence) Let P X and Q X be two kernels, let P and Q = Q X. Then = P X D(P Q ) D(P X Q X ) equality iff D( Q X P ) = 0 Pictorially: P X P Q X Q 21
3 6. (Data-processing for divergences) Let P = P X P = P X ( x) dpx } D(P Q = P X ( x) dq Q ) D( Q X ) (2.3) X Pictorially: P X P X D( Q X ) D(P Q ) Q X Q Proof. We only illustrate these results for the case of finite alphabets. General case follows by doing a careful analysis of Radon-Nikodym derivatives, introduction of regular branches of conditional probability etc. For certain cases (e.g. separable metric spaces), however, we can simply discretize alphabets and take granularity of discretization to 0. This method will become clearer in Lecture 4, once we understand continuity of D. 1. Ex [D(P X=x Q X=x )] = E (X, ) PX P X [log P X Q X PX ] 2. Disintegration: E (X, ) [log Q X ] = E (X, ) [log P X Q X + log 3. Apply 2. with X and interchanged and use D( ) Telescoping n = n i=1 P n X X i 1 and Q X n = i i= 1 QX i X i Inequality follows from monotonicity. To get conditions for equality, notice that by the chain rule for D: Q X D( Q X ) = D( P X Q X ) + D( ) = 0 = D( P Q X P ) + D(P Q ) X and hence we get the claimed result from positivity of D. 6. This again follows from monotonicity. Corollary 2.1. D(1 X n Q X1 Q Xn ) D(i Q Xi ) or n = iff n = j 1 P = X j Note: In general we can have D( Q X ) D( PX X ) + D( P Q ). For example, if X = under P and Q, then D( D( Q X ) = D( Q X ) <Q 2D( Q X ). Conversely, if = Q X and P = Q but Q X we have D( Q X ) > 0 = D( Q X ) + D( P Q ). Corollary 2.2. = f(x) D( P Q ) D( Q X ), with equality if f is ]
4 Note: D(P Q ) = D( Q X ) / f is 1-1. Example: = Gaussian, Q X = Laplace, = X. Corollary 2.3 (Large deviations estimate). For any subset E X we have d( [ E] Q X [ E]) D( Q X ) Proof. Consider = 1 {X E}. 2.3 Mutual information Definition 2.3 (Mutual information). Note: I(X; ) = D( P ) Intuition: I(X; ) measures the dependence between X and, or, the information about X (resp. ) provided by (resp. X) Defined by Shannon (in a different form), in this form by Fano. Note: not restricted to discrete. I(X; ) is a functional of the joint distribution, or equivalently, the pair (, P X ). Theorem 2.3 (Properties of I). 1. I(X; ) = D( P ) = D(P X P ) = D( P ) 2. Symmetry: I(X; ) = I( ; X) 3. Positivity: I(X; ) 0; I(X; ) = 0 iff X 4. I(f(X); ) I(X; ); f one-to-one I(f(X); ) = I( X; ) 5. More data More info : I( X 1, X 2 ; Z) I(X 1 ; Z) Proof. 1. I(X; ) = E log P = E log P X P = E log. 2. Apply data-processing inequality twice to the map ( x, y) ( y, x) to get D(, P ) = D( P,X P ). 3. By definition. 4. We will use the data-processing property of mutual information (to be proved shortly, see Theorem 2.5). Consider the chain of data processing: (x, y) (f(x), y) (f 1 (f(x)), y). Then I(X; ) I(f(X); ) I(f 1 (f(x)); ) = I(X; ) 5. Consider f(x1, X 2 ) = X 1. Theorem 2.4 (I v.s. H). 23
5 ( ) = H( X) X discrete 1. I X; X + otherwise 2. If X, discrete then I(X; ) = H(X) + H( ) H(X, ) If only X discrete then I( X; ) = H( X) H( X ) 3. If X, are real-valued vectors, have joint pdf and all three differential entropies are finite then I(X; ) = h(x) + h( ) h( X, ) If X has marginal pdf p X and conditional pdf p X (x y) then I(X; ) = h(x) h(x ). 4. If X or are discrete then I( X; ) min ( H( X ), H( )), with equality iff H( X ) = 0 or H( X) = 0, i.e., one is a deterministic function of the other. Proof. 1. By definition, I(X; X) = D( X ) = E x X D( δx ). If is discrete, then 1 D( δ x ) = log ( x) and I (X; X) = H ( X). If is not discrete, then let A = { x ( x) > 0} c denote the set of atoms of. Let = {( x, x) x / A} X X. Then,X ( ) = PX(A ) > 0 but since ( )( E) (dx 1 ) ( dx 2 ) 1{( x 1, x 2 ) E} X X we have by taking E = that ( )( ) = 0. Thus / and thus,x I(X; X) = D(,X ) = E log P = E [log 1 + log 1 P log 1 ]. 3. Similarly, when, and P have densities p X and p X p we have 4. Follows from 2. p X D( P ) E [ log ] = h( X) + h( ) h( X, ) p X p Corollary 2.4 (Conditioning reduces entropy). X discrete: H( X ) H( X ), with equality iff X. Intuition: The amount of entropy reduction = mutual information i.i.d. Example: X = U OR, where U, Bern( 1 2 ). Then X Bern( 3 4 ) and H(X) = h( 1 4 ) < 1 bits = H(X = 0), i.e., conditioning on = 0 increases entropy. But on average, H(X ) = P [ = 0] H(X = 0) + P [ = 1] H(X = 1) = 1 2 bits < H(X), by the strong concavity of h( ). Note: Information, entropy and Venn diagrams: 1. The following Venn diagram illustrates the relationship between entropy, conditional entropy, joint entropy, and mutual information. 24
6 H(X, ) H( X) I(X; ) H(X ) H( ) H(X) 2. If you do the same for 3 variables, you will discover that the triple intersection corresponds to H(X 1 ) + H(X 2 ) + H(X 3 ) H(X 1, X 2 ) H(X 2, X 3 ) H(X 1, X 3 ) + H( X 1, X 2, X 3 ) (2.4) which is sometimes denoted I( X; ; Z ). It can be both positive and negative (why?). 3. In general, one can treat random variables as sets (so that r.v. X i corresponds to set E i and ( X 1, X 2 ) corresponds to E 1 E 2 ). Then we can define a unique signed measure µ on the finite algebra generated by these sets so that every information quantity is found by replacing I/ H µ ;,. As an example, we have H(X 1 X 2, X 3 ) = µ (E 1 (E 2 E 3 )), (2.5) I( X 1, X 2 ; X 3 X 4 ) = µ ((( E1 E 2 ) E 3 ) E 4 ). (2.6) By inclusion-exclusion, quantity (2.4) corresponds to µ(e is not necessarily a positive measure. Example: Bivariate Gaussian. X, jointly Gaussian 1 1 I( X; ) = log 2 1 ρ 2 X 1 E 2 E 3 ), which explains why µ I(X; ) where ρ X E[(X EX)( E )] σ X σ [ 1, 1] is the correlation coefficient ρ Proof. WLOG, by shifting and scaling if necessary, we can assume EX = E = 0 and EX 2 = E 2 = 1. Then ρ = EX. By joint Gaussianity, = ρx + Z for some Z N ( 0, 1 ρ 2 ) X. Then using the divergence formula for Gaussians (1.16), we get I(X; ) = D(P X P ) = ED(N ( ρx, 1 ρ 2 ) N (0, 1)) 1 1 = E [ log 2 1 ρ 2 + log e 2 ((ρx)2 + 1 ρ 2 1)] = 1 2 log 1 1 ρ 2 25
7 Note: Similar to the role of mutual information, the correlation coefficient also measures the dependency between random variables which are real-valued (more generally, on an inner-product space) in certain sense. However, mutual information is invariant to bijections and more general: it can be defined not just for numerical random variables, but also for apples and oranges. Example: Additive white Gaussian noise (AWGN) channel. X N independent Gaussian N X + I(X; X + N) = 1 2 log (1 + σ2 X ) σn 2 signal-to-noise ratio (SNR) Example: Gaussian vectors. X R m, R n jointly Gaussian I(X; ) = 1 2 log det Σ X det Σ det Σ[ X, ] where Σ X E [(X EX)(X EX) ] denotes the covariance matrix of X R m, and Σ X,] denotes the the covariance matrix of the random vector [X, ] R m + n [. In the special case of additive noise: = X + N for N X, we have I( X; X + N) = 1 2 log det(σ X + Σ N ) det Σ N Σ Σ X why? Σ X+Σ N ) = since det Σ [X,X+N] = det ( X det Σ X Σ X det Σ N. Example: Binary symmetric channel (BSC). N X δ 1 δ δ 0 1 X + X Bern( 1 ), N Bern(δ) 2 = X + N I(X; ) = log 2 h(δ) Example: Addition over finite groups. X is uniform on G and independent of Z. Then I(X; X + Z) = log G H(Z) Proof. Show that X + Z is uniform on G regardless of Z. 26
8 2.4 Conditional mutual information and conditional independence Definition 2.4 (Conditional mutual information). I(X; Z) = D( Z Z P Z P Z ) (2.7) = E z P Z [I(X; Z = z)]. (2.8) where the product of two random transformations is ( Z = z P Z = z )( x, y) Z ( x z) P Z ( y z ), under which X and are independent conditioned on Z. Note: I(X; Z) is a functional of Z. Remark 2.2 (Conditional independence). A family of distributions can be represented by a directed acyclic graph. A simple example is a Markov chain (line graph), which represents distributions that factor as { Z Z = P X P Z }. X Z Z = PZ P Z X = P Z Z = P X PZ Cond. indep. X,, Z form a Markov chain notation X Z Theorem 2.5 (Further properties of Mutual Information). 1. I(X; Z ) 0, with equality iff X Z 2. (Kolmogorov identity or small chain rule) Z = P P Z Z X I(X, ; Z) = I(X; Z) + I( ; Z X) = I( ; Z) + I( X; Z ) 3. (Data Processing) If X Z, then a) I( X; Z) I( X; ) b) I(X; Z) I( X; ) 4. (Full chain rule) I(X n ; ) = n k=1 I(X k ; X k 1 ) Proof. 1. By definition and Theorem Z P Z = Z P Z P XZ P X 27
9 3. Apply Kolmogorov identity to I(, Z; X): 4. Recursive application of Kolmogorov identity. I(, Z; X) = I(X; ) + I( X; Z ) =0 = I( X; Z) + I( X; Z) Example: 1-to-1 function I( X; ) = I( X; f( )) Note: In general, I( X; Z) I(X; ). Examples: a) > : Conditioning does not always decrease M.I. To find counterexamples when X,, Z do not form a Markov chain, notice that there is only one directed acyclic graph non-isomorphic to X Z, namely X Z. Then a counterexample is i.i.d. 1 I( X; Z) = I( X; X Z Z) = H( X) = log 2 X, Z Bern( ) 2 = X Z I(X; ) = 0 since X b) < : Z =. Then I( X; ) = 0. Note: (Chain rule for I Chain rule for H) Set = X n. Then H( X n ) = I( X n ; X n ) = n ( ) = ( = 1 I X k; X n X k 1 k n = 1 H X k Xk 1 ), since H( X X k k X n 1, ) = 0. k Remark 2.3 (Data processing for mutual information via data processing of divergence). We proved data processing for mutual information in Theorem 2.5 using Kolmogorov s identity. In fact, data processing for mutual information is implied by the data processing for divergence: I(X; Z) = D(P Z X P Z ) D(P X P ) = I( X; ), P Z where note that for each x, we have P X = x P Z X= x and P P Z. Therefore if we have a bi-variate functional of distributions D( P Q) which satisfies data processing, then we can define an M.I.-like quantity via I D ( X; ) D( P X P ) E x PX D( P X=x P ) which will satisfy data processing on Markov chains. A rich class of examples arises by taking D = D f (an f-divergence, defined in (1.15)). That f-divergence satisfies data-processing is going to be shown in Remark 4.2. P Z 2.5 Strong data-processing inequalities For many random transformations P X, it is possible to improve the data-processing inequality (2.3): For any, Q X we have D(P Q ) η KL D( Q X ), where η KL < 1 and depends on the channel P X only. Similarly, this gives an improvement in the data-processing inequality for mutual information: For any P U,X we have U X I( U; ) η KL I( U; X ). For example, for P X = BSC(δ) we have η KL = (1 2δ) 2. Strong data-processing inequalities quantify the intuitive observation that noise inside the channel P X must reduce the information that carries about the data U, regardless of how smart the hook up U X is. This is an active area of research, see [PW15] for a short summary. 28
10 2.6* How to avoid measurability problems? As we mentioned in Remark 2.1 conditions imposed by Definition 2.1 on P X are insufficient. Namely, we get the following two issues: 1. Radon-Nikodym derivatives such as dp X=x 2. Set {x P X=x Q } may not be measurable. The easiest way to avoid all such problems is the following: dq (y) may not be jointly measurable in (x, y) Agreement A1: All conditional kernels P X X in these notes will be assumed to be defined by choosing a σ-finite measure µ 2 on and measurable function ρ(y x) 0 on X such that P X(A x) = ρ(y x)µ 2 (dy) for all x and measurable sets A and ρ( y x ) µ 2 (dy) = 1 for all x. Notes: A 1. Given another kernel Q X specified via ρ (y x) and µ = + 2 we may first replace µ 2 and µ 2 via µ 2 µ 2 µ 2 and thus assume that both P X and Q X are specified in terms of the same dominating measure µ 2. (This modifies ρ(y x) to ρ( y x) dµ 2 (y).) 2. Given two kernels P X and Q X specified in terms of the same dominating measure µ 2 and functions ρ P (y x) and ρ Q (y x), respectively, we may set dp X ρ P (y x) dq X ρ Q ( y x) outside of ρ Q = 0. When P X=x Q X=x the above gives a version of the Radon-Nikodym derivative, which is automatically measurable in (x, y). dµ 2 3. Given Q specified as we may set A 0 dq = q( y) dµ 2 = {x ρ(y x)dµ 2 = 0} {q=0} This set plays a role of { x P X = x Q }. Unlike the latter A0 is guaranteed to be measurable by Fubini [Ç11, Prop. 6.9]. By plays a role we mean that it allows to prove statements like: For any, Q [ A 0 ] = 1. So, while our agreement resolves the two measurability problems above, it introduces a new one. Indeed, given a joint distribution, on standard Borel spaces, it is always true that one can extract a conditional distribution P X satisfying Definition 2.1 (this is called disintegration). However, it is not guaranteed that P X will satisfy Agreement A1. To work around this issue as well, we add another agreement: 29
11 Agreement A2: All joint distributions, are specified by means of data: µ 1,µ 2 σ-finite measures on X and, respectively, and measurable function λ( x, y) such that Notes:, (E) E λ(x, y)µ 1 (dx)µ 2 ( dy). 1. Again, given a finite or countable collection of joint distributions,, Q X,,... satisfying A2 we may without loss of generality assume they are defined in terms of a common µ 1, µ Given, satisfying A2 we can disintegrate it into conditional (satisfying A1) and marginal: P X (A x) = ρ ( λ y x)µ 2 (dy) ρ(y x) (x, y) (2.9) A p(x) PX( A) = p( x ) µ 1 ( dx) p( x) λ( x, η)µ 2 (dη) (2.10) with ρ(y x) defined arbitrarily for those x, for which p(x) = 0. A Remark 2.4. The first problem can also be resolved with the help of Doob s version of Radon- Nikodym theorem [Ç11, Chapter V.4, Theorem 4.44]: If the σ-algebra on is separable (satisfied whenever is a Polish space, for example) and P X =x Q X=x then there exists a jointly measurable version of Radon-Nikodym derivative dp x ( x, y) X = (y) dq X=x 30
12 MIT OpenCourseWare Information Theory Spring 2016 For information about citing these materials or our Terms of Use, visit:
Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye
Chapter 2: Entropy and Mutual Information Chapter 2 outline Definitions Entropy Joint entropy, conditional entropy Relative entropy, mutual information Chain rules Jensen s inequality Log-sum inequality
More informationLecture 7. Sums of random variables
18.175: Lecture 7 Sums of random variables Scott Sheffield MIT 18.175 Lecture 7 1 Outline Definitions Sums of random variables 18.175 Lecture 7 2 Outline Definitions Sums of random variables 18.175 Lecture
More informationMGMT 69000: Topics in High-dimensional Data Analysis Falll 2016
MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 Lecture 14: Information Theoretic Methods Lecturer: Jiaming Xu Scribe: Hilda Ibriga, Adarsh Barik, December 02, 2016 Outline f-divergence
More informationComplex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity
Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity Eckehard Olbrich MPI MiS Leipzig Potsdam WS 2007/08 Olbrich (Leipzig) 26.10.2007 1 / 18 Overview 1 Summary
More informationLecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157
Lecture 6: Gaussian Channels Copyright G. Caire (Sample Lectures) 157 Differential entropy (1) Definition 18. The (joint) differential entropy of a continuous random vector X n p X n(x) over R is: Z h(x
More informationLECTURE 2. Convexity and related notions. Last time: mutual information: definitions and properties. Lecture outline
LECTURE 2 Convexity and related notions Last time: Goals and mechanics of the class notation entropy: definitions and properties mutual information: definitions and properties Lecture outline Convexity
More informationECE598: Information-theoretic methods in high-dimensional statistics Spring 2016
ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture : Mutual Information Method Lecturer: Yihong Wu Scribe: Jaeho Lee, Mar, 06 Ed. Mar 9 Quick review: Assouad s lemma
More informationChapter 8: Differential entropy. University of Illinois at Chicago ECE 534, Natasha Devroye
Chapter 8: Differential entropy Chapter 8 outline Motivation Definitions Relation to discrete entropy Joint and conditional differential entropy Relative entropy and mutual information Properties AEP for
More informationLecture 8: Channel Capacity, Continuous Random Variables
EE376A/STATS376A Information Theory Lecture 8-02/0/208 Lecture 8: Channel Capacity, Continuous Random Variables Lecturer: Tsachy Weissman Scribe: Augustine Chemparathy, Adithya Ganesh, Philip Hwang Channel
More informationLECTURE 3. Last time:
LECTURE 3 Last time: Mutual Information. Convexity and concavity Jensen s inequality Information Inequality Data processing theorem Fano s Inequality Lecture outline Stochastic processes, Entropy rate
More informationECE 4400:693 - Information Theory
ECE 4400:693 - Information Theory Dr. Nghi Tran Lecture 8: Differential Entropy Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 1 / 43 Outline 1 Review: Entropy of discrete RVs 2 Differential
More informationInformation Theory and Communication
Information Theory and Communication Ritwik Banerjee rbanerjee@cs.stonybrook.edu c Ritwik Banerjee Information Theory and Communication 1/8 General Chain Rules Definition Conditional mutual information
More informationMultiple Random Variables
Multiple Random Variables This Version: July 30, 2015 Multiple Random Variables 2 Now we consider models with more than one r.v. These are called multivariate models For instance: height and weight An
More informationLecture 2: August 31
0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: August 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy
More informationLecture 17: Differential Entropy
Lecture 17: Differential Entropy Differential entropy AEP for differential entropy Quantization Maximum differential entropy Estimation counterpart of Fano s inequality Dr. Yao Xie, ECE587, Information
More informationPart IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015
Part IA Probability Definitions Based on lectures by R. Weber Notes taken by Dexter Chua Lent 2015 These notes are not endorsed by the lecturers, and I have modified them (often significantly) after lectures.
More informationEE376A: Homeworks #4 Solutions Due on Thursday, February 22, 2018 Please submit on Gradescope. Start every question on a new page.
EE376A: Homeworks #4 Solutions Due on Thursday, February 22, 28 Please submit on Gradescope. Start every question on a new page.. Maximum Differential Entropy (a) Show that among all distributions supported
More informationELEC546 Review of Information Theory
ELEC546 Review of Information Theory Vincent Lau 1/1/004 1 Review of Information Theory Entropy: Measure of uncertainty of a random variable X. The entropy of X, H(X), is given by: If X is a discrete random
More informationHands-On Learning Theory Fall 2016, Lecture 3
Hands-On Learning Theory Fall 016, Lecture 3 Jean Honorio jhonorio@purdue.edu 1 Information Theory First, we provide some information theory background. Definition 3.1 (Entropy). The entropy of a discrete
More informationExercises with solutions (Set D)
Exercises with solutions Set D. A fair die is rolled at the same time as a fair coin is tossed. Let A be the number on the upper surface of the die and let B describe the outcome of the coin toss, where
More informationLecture 5 Channel Coding over Continuous Channels
Lecture 5 Channel Coding over Continuous Channels I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw November 14, 2014 1 / 34 I-Hsiang Wang NIT Lecture 5 From
More informationMachine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang
Machine Learning Lecture 02.2: Basics of Information Theory Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology Nevin L. Zhang
More informationMultivariate Random Variable
Multivariate Random Variable Author: Author: Andrés Hincapié and Linyi Cao This Version: August 7, 2016 Multivariate Random Variable 3 Now we consider models with more than one r.v. These are called multivariate
More informationEE4601 Communication Systems
EE4601 Communication Systems Week 2 Review of Probability, Important Distributions 0 c 2011, Georgia Institute of Technology (lect2 1) Conditional Probability Consider a sample space that consists of two
More informationLecture 6 Basic Probability
Lecture 6: Basic Probability 1 of 17 Course: Theory of Probability I Term: Fall 2013 Instructor: Gordan Zitkovic Lecture 6 Basic Probability Probability spaces A mathematical setup behind a probabilistic
More informationECE534, Spring 2018: Solutions for Problem Set #3
ECE534, Spring 08: Solutions for Problem Set #3 Jointly Gaussian Random Variables and MMSE Estimation Suppose that X, Y are jointly Gaussian random variables with µ X = µ Y = 0 and σ X = σ Y = Let their
More information5 Mutual Information and Channel Capacity
5 Mutual Information and Channel Capacity In Section 2, we have seen the use of a quantity called entropy to measure the amount of randomness in a random variable. In this section, we introduce several
More informationEEL 5544 Noise in Linear Systems Lecture 30. X (s) = E [ e sx] f X (x)e sx dx. Moments can be found from the Laplace transform as
L30-1 EEL 5544 Noise in Linear Systems Lecture 30 OTHER TRANSFORMS For a continuous, nonnegative RV X, the Laplace transform of X is X (s) = E [ e sx] = 0 f X (x)e sx dx. For a nonnegative RV, the Laplace
More informationLecture 14 February 28
EE/Stats 376A: Information Theory Winter 07 Lecture 4 February 8 Lecturer: David Tse Scribe: Sagnik M, Vivek B 4 Outline Gaussian channel and capacity Information measures for continuous random variables
More informationThe binary entropy function
ECE 7680 Lecture 2 Definitions and Basic Facts Objective: To learn a bunch of definitions about entropy and information measures that will be useful through the quarter, and to present some simple but
More informationEntropy and Ergodic Theory Lecture 4: Conditional entropy and mutual information
Entropy and Ergodic Theory Lecture 4: Conditional entropy and mutual information 1 Conditional entropy Let (Ω, F, P) be a probability space, let X be a RV taking values in some finite set A. In this lecture
More informationProbability and Measure
Part II Year 2018 2017 2016 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2018 84 Paper 4, Section II 26J Let (X, A) be a measurable space. Let T : X X be a measurable map, and µ a probability
More informationHomework Set #2 Data Compression, Huffman code and AEP
Homework Set #2 Data Compression, Huffman code and AEP 1. Huffman coding. Consider the random variable ( x1 x X = 2 x 3 x 4 x 5 x 6 x 7 0.50 0.26 0.11 0.04 0.04 0.03 0.02 (a Find a binary Huffman code
More informationLectures 22-23: Conditional Expectations
Lectures 22-23: Conditional Expectations 1.) Definitions Let X be an integrable random variable defined on a probability space (Ω, F 0, P ) and let F be a sub-σ-algebra of F 0. Then the conditional expectation
More informationInformation Theory Primer:
Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen s inequality Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,
More informationLecture 5 - Information theory
Lecture 5 - Information theory Jan Bouda FI MU May 18, 2012 Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 1 / 42 Part I Uncertainty and entropy Jan Bouda (FI MU) Lecture 5 - Information
More informationLecture Notes 3 Multiple Random Variables. Joint, Marginal, and Conditional pmfs. Bayes Rule and Independence for pmfs
Lecture Notes 3 Multiple Random Variables Joint, Marginal, and Conditional pmfs Bayes Rule and Independence for pmfs Joint, Marginal, and Conditional pdfs Bayes Rule and Independence for pdfs Functions
More informationPart II Probability and Measure
Part II Probability and Measure Theorems Based on lectures by J. Miller Notes taken by Dexter Chua Michaelmas 2016 These notes are not endorsed by the lecturers, and I have modified them (often significantly)
More informationLecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable
Lecture Notes 1 Probability and Random Variables Probability Spaces Conditional Probability and Independence Random Variables Functions of a Random Variable Generation of a Random Variable Jointly Distributed
More informationHomework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015
10-704 Homework 1 Due: Thursday 2/5/2015 Instructions: Turn in your homework in class on Thursday 2/5/2015 1. Information Theory Basics and Inequalities C&T 2.47, 2.29 (a) A deck of n cards in order 1,
More informationChapter I: Fundamental Information Theory
ECE-S622/T62 Notes Chapter I: Fundamental Information Theory Ruifeng Zhang Dept. of Electrical & Computer Eng. Drexel University. Information Source Information is the outcome of some physical processes.
More information4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information
4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information Ramji Venkataramanan Signal Processing and Communications Lab Department of Engineering ramji.v@eng.cam.ac.uk
More informationLecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable
Lecture Notes 1 Probability and Random Variables Probability Spaces Conditional Probability and Independence Random Variables Functions of a Random Variable Generation of a Random Variable Jointly Distributed
More informationApplication of Information Theory, Lecture 7. Relative Entropy. Handout Mode. Iftach Haitner. Tel Aviv University.
Application of Information Theory, Lecture 7 Relative Entropy Handout Mode Iftach Haitner Tel Aviv University. December 1, 2015 Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December
More informationEE/Stats 376A: Homework 7 Solutions Due on Friday March 17, 5 pm
EE/Stats 376A: Homework 7 Solutions Due on Friday March 17, 5 pm 1. Feedback does not increase the capacity. Consider a channel with feedback. We assume that all the recieved outputs are sent back immediately
More informationReview (Probability & Linear Algebra)
Review (Probability & Linear Algebra) CE-725 : Statistical Pattern Recognition Sharif University of Technology Spring 2013 M. Soleymani Outline Axioms of probability theory Conditional probability, Joint
More informationDistance-Divergence Inequalities
Distance-Divergence Inequalities Katalin Marton Alfréd Rényi Institute of Mathematics of the Hungarian Academy of Sciences Motivation To find a simple proof of the Blowing-up Lemma, proved by Ahlswede,
More informationSTA 711: Probability & Measure Theory Robert L. Wolpert
STA 711: Probability & Measure Theory Robert L. Wolpert 6 Independence 6.1 Independent Events A collection of events {A i } F in a probability space (Ω,F,P) is called independent if P[ i I A i ] = P[A
More informationChapter 3 sections. SKIP: 3.10 Markov Chains. SKIP: pages Chapter 3 - continued
Chapter 3 sections Chapter 3 - continued 3.1 Random Variables and Discrete Distributions 3.2 Continuous Distributions 3.3 The Cumulative Distribution Function 3.4 Bivariate Distributions 3.5 Marginal Distributions
More informationCOMPSCI 650 Applied Information Theory Jan 21, Lecture 2
COMPSCI 650 Applied Information Theory Jan 21, 2016 Lecture 2 Instructor: Arya Mazumdar Scribe: Gayane Vardoyan, Jong-Chyi Su 1 Entropy Definition: Entropy is a measure of uncertainty of a random variable.
More information3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.
Appendix A Information Theory A.1 Entropy Shannon (Shanon, 1948) developed the concept of entropy to measure the uncertainty of a discrete random variable. Suppose X is a discrete random variable that
More informationChapter 3 sections. SKIP: 3.10 Markov Chains. SKIP: pages Chapter 3 - continued
Chapter 3 sections 3.1 Random Variables and Discrete Distributions 3.2 Continuous Distributions 3.3 The Cumulative Distribution Function 3.4 Bivariate Distributions 3.5 Marginal Distributions 3.6 Conditional
More informationMASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 9 10/2/2013. Conditional expectations, filtration and martingales
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 9 10/2/2013 Conditional expectations, filtration and martingales Content. 1. Conditional expectations 2. Martingales, sub-martingales
More informationMassachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014
Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 2014 Problem Set 3 Issued: Thursday, September 25, 2014 Due: Thursday,
More informationBrownian Motion and Conditional Probability
Math 561: Theory of Probability (Spring 2018) Week 10 Brownian Motion and Conditional Probability 10.1 Standard Brownian Motion (SBM) Brownian motion is a stochastic process with both practical and theoretical
More informationLecture 5. If we interpret the index n 0 as time, then a Markov chain simply requires that the future depends only on the present and not on the past.
1 Markov chain: definition Lecture 5 Definition 1.1 Markov chain] A sequence of random variables (X n ) n 0 taking values in a measurable state space (S, S) is called a (discrete time) Markov chain, if
More informationClass 8 Review Problems solutions, 18.05, Spring 2014
Class 8 Review Problems solutions, 8.5, Spring 4 Counting and Probability. (a) Create an arrangement in stages and count the number of possibilities at each stage: ( ) Stage : Choose three of the slots
More informationShannon meets Wiener II: On MMSE estimation in successive decoding schemes
Shannon meets Wiener II: On MMSE estimation in successive decoding schemes G. David Forney, Jr. MIT Cambridge, MA 0239 USA forneyd@comcast.net Abstract We continue to discuss why MMSE estimation arises
More informationMeasure-theoretic probability
Measure-theoretic probability Koltay L. VEGTMAM144B November 28, 2012 (VEGTMAM144B) Measure-theoretic probability November 28, 2012 1 / 27 The probability space De nition The (Ω, A, P) measure space is
More informationPerhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.
Chapter 5 Two Random Variables In a practical engineering problem, there is almost always causal relationship between different events. Some relationships are determined by physical laws, e.g., voltage
More information4 Derivations of the Discrete-Time Kalman Filter
Technion Israel Institute of Technology, Department of Electrical Engineering Estimation and Identification in Dynamical Systems (048825) Lecture Notes, Fall 2009, Prof N Shimkin 4 Derivations of the Discrete-Time
More informationEE376A - Information Theory Final, Monday March 14th 2016 Solutions. Please start answering each question on a new page of the answer booklet.
EE376A - Information Theory Final, Monday March 14th 216 Solutions Instructions: You have three hours, 3.3PM - 6.3PM The exam has 4 questions, totaling 12 points. Please start answering each question on
More informationLecture 10: Broadcast Channel and Superposition Coding
Lecture 10: Broadcast Channel and Superposition Coding Scribed by: Zhe Yao 1 Broadcast channel M 0M 1M P{y 1 y x} M M 01 1 M M 0 The capacity of the broadcast channel depends only on the marginal conditional
More informationREVIEW OF MAIN CONCEPTS AND FORMULAS A B = Ā B. Pr(A B C) = Pr(A) Pr(A B C) =Pr(A) Pr(B A) Pr(C A B)
REVIEW OF MAIN CONCEPTS AND FORMULAS Boolean algebra of events (subsets of a sample space) DeMorgan s formula: A B = Ā B A B = Ā B The notion of conditional probability, and of mutual independence of two
More informationLecture 22: Final Review
Lecture 22: Final Review Nuts and bolts Fundamental questions and limits Tools Practical algorithms Future topics Dr Yao Xie, ECE587, Information Theory, Duke University Basics Dr Yao Xie, ECE587, Information
More information3F1: Signals and Systems INFORMATION THEORY Examples Paper Solutions
Engineering Tripos Part IIA THIRD YEAR 3F: Signals and Systems INFORMATION THEORY Examples Paper Solutions. Let the joint probability mass function of two binary random variables X and Y be given in the
More informationMTH 404: Measure and Integration
MTH 404: Measure and Integration Semester 2, 2012-2013 Dr. Prahlad Vaidyanathan Contents I. Introduction....................................... 3 1. Motivation................................... 3 2. The
More informationLecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1)
3- Mathematical methods in communication Lecture 3 Lecturer: Haim Permuter Scribe: Yuval Carmel, Dima Khaykin, Ziv Goldfeld I. REMINDER A. Convex Set A set R is a convex set iff, x,x 2 R, θ, θ, θx + θx
More informationMath General Topology Fall 2012 Homework 6 Solutions
Math 535 - General Topology Fall 202 Homework 6 Solutions Problem. Let F be the field R or C of real or complex numbers. Let n and denote by F[x, x 2,..., x n ] the set of all polynomials in n variables
More informationJoint Probability Distributions and Random Samples (Devore Chapter Five)
Joint Probability Distributions and Random Samples (Devore Chapter Five) 1016-345-01: Probability and Statistics for Engineers Spring 2013 Contents 1 Joint Probability Distributions 2 1.1 Two Discrete
More informationIf Y and Y 0 satisfy (1-2), then Y = Y 0 a.s.
20 6. CONDITIONAL EXPECTATION Having discussed at length the limit theory for sums of independent random variables we will now move on to deal with dependent random variables. An important tool in this
More informationMachine Learning Lecture Notes
Machine Learning Lecture Notes Predrag Radivojac January 3, 25 Random Variables Until now we operated on relatively simple sample spaces and produced measure functions over sets of outcomes. In many situations,
More informationLECTURE 13. Last time: Lecture outline
LECTURE 13 Last time: Strong coding theorem Revisiting channel and codes Bound on probability of error Error exponent Lecture outline Fano s Lemma revisited Fano s inequality for codewords Converse to
More informationLecture 2. Capacity of the Gaussian channel
Spring, 207 5237S, Wireless Communications II 2. Lecture 2 Capacity of the Gaussian channel Review on basic concepts in inf. theory ( Cover&Thomas: Elements of Inf. Theory, Tse&Viswanath: Appendix B) AWGN
More informationLecture 11: Continuous-valued signals and differential entropy
Lecture 11: Continuous-valued signals and differential entropy Biology 429 Carl Bergstrom September 20, 2008 Sources: Parts of today s lecture follow Chapter 8 from Cover and Thomas (2007). Some components
More information9 Radon-Nikodym theorem and conditioning
Tel Aviv University, 2015 Functions of real variables 93 9 Radon-Nikodym theorem and conditioning 9a Borel-Kolmogorov paradox............. 93 9b Radon-Nikodym theorem.............. 94 9c Conditioning.....................
More informationP (A G) dp G P (A G)
First homework assignment. Due at 12:15 on 22 September 2016. Homework 1. We roll two dices. X is the result of one of them and Z the sum of the results. Find E [X Z. Homework 2. Let X be a r.v.. Assume
More informationHow to Quantitate a Markov Chain? Stochostic project 1
How to Quantitate a Markov Chain? Stochostic project 1 Chi-Ning,Chou Wei-chang,Lee PROFESSOR RAOUL NORMAND April 18, 2015 Abstract In this project, we want to quantitatively evaluate a Markov chain. In
More informationConcentration Inequalities
Chapter Concentration Inequalities I. Moment generating functions, the Chernoff method, and sub-gaussian and sub-exponential random variables a. Goal for this section: given a random variable X, how does
More informationFunctional Properties of MMSE
Functional Properties of MMSE Yihong Wu epartment of Electrical Engineering Princeton University Princeton, NJ 08544, USA Email: yihongwu@princeton.edu Sergio Verdú epartment of Electrical Engineering
More informationCapacity of AWGN channels
Chapter 3 Capacity of AWGN channels In this chapter we prove that the capacity of an AWGN channel with bandwidth W and signal-tonoise ratio SNR is W log 2 (1+SNR) bits per second (b/s). The proof that
More informationconditional cdf, conditional pdf, total probability theorem?
6 Multiple Random Variables 6.0 INTRODUCTION scalar vs. random variable cdf, pdf transformation of a random variable conditional cdf, conditional pdf, total probability theorem expectation of a random
More informationLecture 4 Noisy Channel Coding
Lecture 4 Noisy Channel Coding I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw October 9, 2015 1 / 56 I-Hsiang Wang IT Lecture 4 The Channel Coding Problem
More informationU Logo Use Guidelines
Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity
More informationNoisy channel communication
Information Theory http://www.inf.ed.ac.uk/teaching/courses/it/ Week 6 Communication channels and Information Some notes on the noisy channel setup: Iain Murray, 2012 School of Informatics, University
More informationBrief Review of Probability
Brief Review of Probability Nuno Vasconcelos (Ken Kreutz-Delgado) ECE Department, UCSD Probability Probability theory is a mathematical language to deal with processes or experiments that are non-deterministic
More informationLecture 6 I. CHANNEL CODING. X n (m) P Y X
6- Introduction to Information Theory Lecture 6 Lecturer: Haim Permuter Scribe: Yoav Eisenberg and Yakov Miron I. CHANNEL CODING We consider the following channel coding problem: m = {,2,..,2 nr} Encoder
More informationIntroduction to Probability and Stocastic Processes - Part I
Introduction to Probability and Stocastic Processes - Part I Lecture 2 Henrik Vie Christensen vie@control.auc.dk Department of Control Engineering Institute of Electronic Systems Aalborg University Denmark
More information1/12/05: sec 3.1 and my article: How good is the Lebesgue measure?, Math. Intelligencer 11(2) (1989),
Real Analysis 2, Math 651, Spring 2005 April 26, 2005 1 Real Analysis 2, Math 651, Spring 2005 Krzysztof Chris Ciesielski 1/12/05: sec 3.1 and my article: How good is the Lebesgue measure?, Math. Intelligencer
More informationLecture 1: Introduction, Entropy and ML estimation
0-704: Information Processing and Learning Spring 202 Lecture : Introduction, Entropy and ML estimation Lecturer: Aarti Singh Scribes: Min Xu Disclaimer: These notes have not been subjected to the usual
More informationExample: Letter Frequencies
Example: Letter Frequencies i a i p i 1 a 0.0575 2 b 0.0128 3 c 0.0263 4 d 0.0285 5 e 0.0913 6 f 0.0173 7 g 0.0133 8 h 0.0313 9 i 0.0599 10 j 0.0006 11 k 0.0084 12 l 0.0335 13 m 0.0235 14 n 0.0596 15 o
More informationEE229B - Final Project. Capacity-Approaching Low-Density Parity-Check Codes
EE229B - Final Project Capacity-Approaching Low-Density Parity-Check Codes Pierre Garrigues EECS department, UC Berkeley garrigue@eecs.berkeley.edu May 13, 2005 Abstract The class of low-density parity-check
More informationPROOF OF ZADOR-GERSHO THEOREM
ZADOR-GERSHO THEOREM FOR VARIABLE-RATE VQ For a stationary source and large R, the least distortion of k-dim'l VQ with nth-order entropy coding and rate R or less is δ(k,n,r) m k * σ 2 η kn 2-2R = Z(k,n,R)
More informationExample: Letter Frequencies
Example: Letter Frequencies i a i p i 1 a 0.0575 2 b 0.0128 3 c 0.0263 4 d 0.0285 5 e 0.0913 6 f 0.0173 7 g 0.0133 8 h 0.0313 9 i 0.0599 10 j 0.0006 11 k 0.0084 12 l 0.0335 13 m 0.0235 14 n 0.0596 15 o
More informationlossless, optimal compressor
6. Variable-length Lossless Compression The principal engineering goal of compression is to represent a given sequence a, a 2,..., a n produced by a source as a sequence of bits of minimal possible length.
More informationErgodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R.
Ergodic Theorems Samy Tindel Purdue University Probability Theory 2 - MA 539 Taken from Probability: Theory and examples by R. Durrett Samy T. Ergodic theorems Probability Theory 1 / 92 Outline 1 Definitions
More informationEE514A Information Theory I Fall 2013
EE514A Information Theory I Fall 2013 K. Mohan, Prof. J. Bilmes University of Washington, Seattle Department of Electrical Engineering Fall Quarter, 2013 http://j.ee.washington.edu/~bilmes/classes/ee514a_fall_2013/
More informationSolutions to Set #2 Data Compression, Huffman code and AEP
Solutions to Set #2 Data Compression, Huffman code and AEP. Huffman coding. Consider the random variable ( ) x x X = 2 x 3 x 4 x 5 x 6 x 7 0.50 0.26 0. 0.04 0.04 0.03 0.02 (a) Find a binary Huffman code
More informationLECTURE 15. Last time: Feedback channel: setting up the problem. Lecture outline. Joint source and channel coding theorem
LECTURE 15 Last time: Feedback channel: setting up the problem Perfect feedback Feedback capacity Data compression Lecture outline Joint source and channel coding theorem Converse Robustness Brain teaser
More informationECE 587 / STA 563: Lecture 2 Measures of Information Information Theory Duke University, Fall 2017
ECE 587 / STA 563: Lecture 2 Measures of Information Information Theory Duke University, Fall 207 Author: Galen Reeves Last Modified: August 3, 207 Outline of lecture: 2. Quantifying Information..................................
More information