Information Theory and Communication

Similar documents
Lecture 5 - Information theory

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity

The binary entropy function

Chapter 8: Differential entropy. University of Illinois at Chicago ECE 534, Natasha Devroye

LECTURE 2. Convexity and related notions. Last time: mutual information: definitions and properties. Lecture outline

Example: Letter Frequencies

Example: Letter Frequencies

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

5 Mutual Information and Channel Capacity

Solutions 1. Introduction to Coding Theory - Spring 2010 Solutions 1. Exercise 1.1. See Examples 1.2 and 1.11 in the course notes.

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Lecture 22: Final Review

Symmetric Matrices and Eigendecomposition

Lecture 2: August 31

Entropy and Ergodic Theory Lecture 4: Conditional entropy and mutual information

Hands-On Learning Theory Fall 2016, Lecture 3

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

Information Theory Primer:

x log x, which is strictly convex, and use Jensen s Inequality:

Machine Learning Srihari. Information Theory. Sargur N. Srihari

Statistical Machine Learning Lectures 4: Variational Bayes

ECE 587 / STA 563: Lecture 2 Measures of Information Information Theory Duke University, Fall 2017

Lecture Notes for Statistics 311/Electrical Engineering 377. John Duchi

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015

Information Theory: Entropy, Markov Chains, and Huffman Coding

H(X) = plog 1 p +(1 p)log 1 1 p. With a slight abuse of notation, we denote this quantity by H(p) and refer to it as the binary entropy function.

Concentration Inequalities

Computing and Communications 2. Information Theory -Entropy

Lecture 2: Convex Sets and Functions

Information Theory. Week 4 Compressing streams. Iain Murray,

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

EE514A Information Theory I Fall 2013

Chapter 2 Review of Classical Information Theory

Chaos, Complexity, and Inference (36-462)

A CLASSROOM NOTE: ENTROPY, INFORMATION, AND MARKOV PROPERTY. Zoran R. Pop-Stojanović. 1. Introduction

Expectation Maximization

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

Iowa State University. Instructor: Alex Roitershtein Summer Homework #1. Solutions

Chapter 1 Preliminaries

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

Expectations and Entropy

1.6. Information Theory

Lecture 4: Convex Functions, Part I February 1

6.1 Main properties of Shannon entropy. Let X be a random variable taking values x in some alphabet with probabilities.

Mathematics II, course

Stat410 Probability and Statistics II (F16)

LECTURE 3. Last time:

The general programming problem is the nonlinear programming problem where a given function is maximized subject to a set of inequality constraints.

Application of Information Theory, Lecture 7. Relative Entropy. Handout Mode. Iftach Haitner. Tel Aviv University.

Information Theory. M1 Informatique (parcours recherche et innovation) Aline Roumy. January INRIA Rennes 1/ 73

2019 Spring MATH2060A Mathematical Analysis II 1

U Logo Use Guidelines

Piecewise Defined Functions

CS 630 Basic Probability and Information Theory. Tim Campbell

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Computational Optimization. Convexity and Unconstrained Optimization 1/29/08 and 2/1(revised)

Extract. Data Analysis Tools

14 Lecture 14 Local Extrema of Function

= lim. (1 + h) 1 = lim. = lim. = lim = 1 2. lim

Exercises with solutions (Set D)

Lecture 1. Stochastic Optimization: Introduction. January 8, 2018

Speech Recognition Lecture 7: Maximum Entropy Models. Mehryar Mohri Courant Institute and Google Research

How to Quantitate a Markov Chain? Stochostic project 1

Posterior Regularization

BASICS OF CONVEX ANALYSIS

Latent Variable Models and EM algorithm

Lecture 17: Differential Entropy

Some Background Math Notes on Limsups, Sets, and Convexity

Lecture 2: Random Variables and Expectation

Lecture Notes for Statistics 311/Electrical Engineering 377. John Duchi

Introduction to Nonlinear Stochastic Programming

ECE 4400:693 - Information Theory

Lecture 1: Introduction, Entropy and ML estimation

Mathematical Foundations

Lecture 5 Channel Coding over Continuous Channels

Some Probability and Statistics

MATHEMATICAL ECONOMICS: OPTIMIZATION. Contents

CLASSICAL PROBABILITY MODES OF CONVERGENCE AND INEQUALITIES

Capacity of AWGN channels

Information Theory in Intelligent Decision Making

SWFR ENG 4TE3 (6TE3) COMP SCI 4TE3 (6TE3) Continuous Optimization Algorithm. Convex Optimization. Computing and Software McMaster University

1/37. Convexity theory. Victor Kitov

Computational Optimization. Mathematical Programming Fundamentals 1/25 (revised)

Section IV.23. Factorizations of Polynomials over a Field

Quiz 1 Date: Monday, October 17, 2016

Quantitative Biology II Lecture 4: Variational Methods

8. Limit Laws. lim(f g)(x) = lim f(x) lim g(x), (x) = lim x a f(x) g lim x a g(x)

Concentration inequalities and the entropy method

EE376A: Homeworks #4 Solutions Due on Thursday, February 22, 2018 Please submit on Gradescope. Start every question on a new page.

Variational Inference (11/04/13)

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

Introduction to Statistical Learning Theory

Ch. 8 Math Preliminaries for Lossy Coding. 8.4 Info Theory Revisited

CHAPTER 4: HIGHER ORDER DERIVATIVES. Likewise, we may define the higher order derivatives. f(x, y, z) = xy 2 + e zx. y = 2xy.

(each row defines a probability distribution). Given n-strings x X n, y Y n we can use the absence of memory in the channel to compute

Quiz 2 Date: Monday, November 21, 2016

Transcription:

Information Theory and Communication Ritwik Banerjee rbanerjee@cs.stonybrook.edu c Ritwik Banerjee Information Theory and Communication 1/8

General Chain Rules Definition Conditional mutual information is of random variables X and Y, given the random variable Z, is the reduction in the uncertainty of X due to knowledge of Y, given Z. It is defined as p(x, Y Z) I(X; Y Z) = H(X Z) H(X Y, Z) = E p(x,y,z) log p(x Z)p(Y Z) Chain Rule for Information: I(X 1, X 2,..., X n; Y ) = n I(X i; Y X 1, X 2,... X i 1) Exercise i=1 Prove the above theorem. Hint: I(X 1, X 2,..., X n; Y ) = H(X 1, X 2,..., X n) H(X 1, X 2,..., X n Y ). Then, write the joint entropy expressions as sums of conditional entropies (use the chain rule for entropy here). c Ritwik Banerjee Information Theory and Communication 2/8

General Chain Rules Definition For joint probability mass functions p(x, y) and q(x, y), the conditional relative entropy, denoted by D (p(y x) q(y x)), is the average of the relative entropies between the conditional probability mass functions p(y x) and q(y x), over the probability mass function p(x). That is, D (p(y x) q(y x)) = p(x) p(y x) log p(y x) q(y x) x X y Y = E p(x,y) log p(y X) q(y X) KL divergence, i.e., relative entropy, between two joint distributions on a pair of random variables can be expressed as the sum of a relative entropy and a conditional relative entropy. c Ritwik Banerjee Information Theory and Communication 3/8

General Chain Rules Chain Rule for Divergence: D(p(x, y) q(x, y)) = D(p(x) q(x)) + D(p(y x) q(y x)). Exercise Prove the above theorem. Hint: Write the joint distributions inside the logarithm as conditionals, and remember to sum out variables that do not occur outside the probability mass function. c Ritwik Banerjee Information Theory and Communication 4/8

Some important questions How do we define inequality of information? When is the entropy of a random variable maximised? Are there any bounds on the entropy of a random variable? Does conditioning always reduce entropy (i.e., is more information really always a good thing)? Next, we will look at some properties of entropy (and other related definitions) and answer these questions. c Ritwik Banerjee Information Theory and Communication 5/8

Convex and Concave functions Definition A function f(x) is said to be convex over an interval (a, b) if x 1, x 2 (a, b), 0 λ 1, f(λx 1 + (1 λx 2)) λf(x 1) + (1 λ)f(x 2). If the equality holds only when λ is 0 or 1, then the function is said to be strictly convex. A function f(x) is said to be concave over an interval (a, b) if f is convex over that interval. If a function f has a second derivative that is non-negative (positive) over an interval, then it is convex (strictly convex) over that interval. c Ritwik Banerjee Information Theory and Communication 6/8

Jensen s Inequality Jensen s Inequality: Given any convex function f and random variable X, E(f(X)) f(e(x)). Proof. By induction (on the number of mass points) for discrete distributions. By using arguments of limits and continuity, this proof can be extended to continuous distributions as well. Consequences of Jensen s inequality The relation between per capita income and well-being is a concave function. Jensen s inequality implies that the maximum average well-being of a society is attained when income is spread evenly (i.e., a uniform distribution). KL divergence is always non-negative. c Ritwik Banerjee Information Theory and Communication 7/8

Information Inequality Let x X and p(x), q(x) be two probability mass functions. Then, the divergence of p from q is non-negative. That is, D(p(x) q(x)) 0 Hint: it is easier to show that the negative of KL divergence is always non-positive. Corollary 1: For any two random variable X and Y, I(X; Y ) 0. Corollary 2: Conditional relative entropy is non-negative. Corollary 3: Conditional mutual information is non-negative: I(X; Y Z) 0. It is equal to zero if and only if X and Y are conditionally independent given Z. c Ritwik Banerjee Information Theory and Communication 8/8