IT and large deviation theory

Similar documents
Information Theory and Hypothesis Testing

The method of types. PhD short course Information Theory and Statistics Siena, September, Mauro Barni University of Siena

INFORMATION THEORY AND STATISTICS

Information Theory and Statistics, Part I

Computing and Communications 2. Information Theory -Entropy

What is a random variable

STAT 430/510 Probability Lecture 7: Random Variable and Expectation

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

M378K In-Class Assignment #1

Probability and Statisitcs

On the Entropy of Sums of Bernoulli Random Variables via the Chen-Stein Method

Math 151. Rumbos Fall Solutions to Review Problems for Exam 2. Pr(X = 1) = ) = Pr(X = 2) = Pr(X = 3) = p X. (k) =

Mathematical Foundations of Computer Science Lecture Outline October 18, 2018

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Random Variables. Statistics 110. Summer Copyright c 2006 by Mark E. Irwin

Chapter 11. Information Theory and Statistics

EE514A Information Theory I Fall 2013

MATH 3670 First Midterm February 17, No books or notes. No cellphone or wireless devices. Write clearly and show your work for every answer.

Fundamental Tools - Probability Theory II

Lecture 22: Error exponents in hypothesis testing, GLRT

EE5319R: Problem Set 3 Assigned: 24/08/16, Due: 31/08/16

Random variables (discrete)

Discrete Probability Refresher

Lecture 18: Optimization Programming

Lecture 23. Random walks

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

Solutionbank S1 Edexcel AS and A Level Modular Mathematics

The Method of Types and Its Application to Information Hiding

If the objects are replaced there are n choices each time yielding n r ways. n C r and in the textbook by g(n, r).

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

6.041/6.431 Spring 2009 Quiz 1 Wednesday, March 11, 7:30-9:30 PM. SOLUTIONS

n(1 p i ) n 1 p i = 1 3 i=1 E(X i p = p i )P(p = p i ) = 1 3 p i = n 3 (p 1 + p 2 + p 3 ). p i i=1 P(X i = 1 p = p i )P(p = p i ) = p1+p2+p3

Lecture 10. Variance and standard deviation

1 Bernoulli Distribution: Single Coin Flip

Lecture 3. Discrete Random Variables

Problem Sheet 1. You may assume that both F and F are σ-fields. (a) Show that F F is not a σ-field. (b) Let X : Ω R be defined by 1 if n = 1

IEOR 3106: Introduction to Operations Research: Stochastic Models. Professor Whitt. SOLUTIONS to Homework Assignment 2

MA : Introductory Probability

Probability Theory Review

Econ 325: Introduction to Empirical Economics

Probability theory for Networks (Part 1) CS 249B: Science of Networks Week 02: Monday, 02/04/08 Daniel Bilar Wellesley College Spring 2008

MAT 135B Midterm 1 Solutions

Dynamic Programming Lecture #4

On the Chi square and higher-order Chi distances for approximating f-divergences

CS 630 Basic Probability and Information Theory. Tim Campbell

STAT 345 Spring 2018 Homework 4 - Discrete Probability Distributions

Evaluating Classifiers. Lecture 2 Instructor: Max Welling

Chapter 2. Probability

Notes 6 Autumn Example (One die: part 1) One fair six-sided die is thrown. X is the number showing.

1 Variance of a Random Variable

Lecture 9. Expectations of discrete random variables

Chap 4 Probability p227 The probability of any outcome in a random phenomenon is the proportion of times the outcome would occur in a long series of

Probabilistic Systems Analysis Spring 2018 Lecture 6. Random Variables: Probability Mass Function and Expectation

1 Probability and Random Variables

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

Introduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research

Speech Recognition Lecture 7: Maximum Entropy Models. Mehryar Mohri Courant Institute and Google Research

Hypothesis Testing. Testing Hypotheses MIT Dr. Kempthorne. Spring MIT Testing Hypotheses

Necessary and Sufficient Conditions for High-Dimensional Salient Feature Subset Recovery

Lecture 5 - Information theory

Lecture 1. ABC of Probability

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

4 Branching Processes

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

2. Variance and Covariance: We will now derive some classic properties of variance and covariance. Assume real-valued random variables X and Y.

Hidden Markov Models. Hosein Mohimani GHC7717

Exercises with solutions (Set D)

MAT 271E Probability and Statistics

18.175: Lecture 14 Infinite divisibility and so forth

18.440: Lecture 26 Conditional expectation

Information. = more information was provided by the outcome in #2

A Gentle Tutorial on Information Theory and Learning. Roni Rosenfeld. Carnegie Mellon University

Probability Theory for Machine Learning. Chris Cremer September 2015

Lecture 6: Entropy Rate

HIDDEN MARKOV MODELS

An Introduction to Bioinformatics Algorithms Hidden Markov Models

ECE 450 Homework #3. 1. Given the joint density function f XY (x,y) = 0.5 1<x<2, 2<y< <x<4, 2<y<3 0 else

Recitation 2: Probability

Discrete Random Variables

CSE 312: Foundations of Computing II Random Variables, Linearity of Expectation 4 Solutions

Mathematical Preliminaries

Design of Engineering Experiments

1 INFO Sep 05

6.4 Type I and Type II Errors

DISCRETE RANDOM VARIABLES: PMF s & CDF s [DEVORE 3.2]

1. Let A be a 2 2 nonzero real matrix. Which of the following is true?

The Information Bottleneck Revisited or How to Choose a Good Distortion Measure

SDS 321: Introduction to Probability and Statistics

MATH Notebook 5 Fall 2018/2019

Chapter 8: Differential entropy. University of Illinois at Chicago ECE 534, Natasha Devroye

Appendix A : Introduction to Probability and stochastic processes

Information Theory in Intelligent Decision Making

Probability Review. Yutian Li. January 18, Stanford University. Yutian Li (Stanford University) Probability Review January 18, / 27

Math Bootcamp 2012 Miscellaneous

X = X X n, + X 2

Problem Points S C O R E Total: 120

Introduction to probability theory

Dept. of Linguistics, Indiana University Fall 2015

The Central Limit Theorem

Probability Theory and Simulation Methods

Transcription:

PhD short course Information Theory and Statistics Siena, 15-19 September, 2014 IT and large deviation theory Mauro Barni University of Siena

Outline of the short course Part 1: Information theory in a nutshell Part 2: The method of types and its relationship with statistics Part 3: Information theory and large deviation theory Part 4: Information theory and hypothesis testing Part 5: Application to adversarial signal processing

Outline of Part 3 Large Deviation Theory Sanov Theorem Conditional limit theorem Examples

Large deviation theory LDT studies the probability of rare events, i.e. events not covered by the law of large numbers Examples What is the probability that in 1000 fair coin tosses head appears 800 times? Compute the probability that the mean value of a sequence (emitted by a DMS X) is larger than T, with T much larger than E[X]. Rare events in statistical physics or economics

Large deviation theory More formally: let S be a subset of pmf s and let Q be a source. We want to compute the probability that Q emits a sequence whose type belongs to S Q(S) = x:p x S Q(x) Example: What is the probability that the average value of a sequence drawn from Q is larger than 4? Above problem with S = pmf s such that E[S] > 4.

Large deviation theory If S contains a KL neighborhood of Q, then Q(S) -> 1 If S does not contain Q or a KL neighborhood of Q, then Q(S) -> 0. The question is: how fast?. Q S S. Q

More formally: Sanov s theorem Theorem (Sanov) Let S be a regular set of pmf s (cl(int(s) = S), then Q(S) 2 nd(p* Q) P * = argmin P S D(P Q) S. P *. Q

Sanov s theorem Proof. (upper bound) Q(S) = Q(T(P)) 2 nd(p Q) 2 nmin P S P n D(P Q) P S P n P S P n P S P n 2 nmin P S D(P Q) = 2 nd(p* Q) P S P n P S P n (n +1) X 2 nd(p* Q)

Sanov s theorem Proof. (lower bound) Due to the regularity of S and the density of P n n in the set of all pmf's we can find a sequence P n P n such that P n P * and hence D(P n Q) D(P * Q). Then for large n we can write: Q(S) = Q(T(P)) Q(T(P n )) P S P n 1 (n +1) X 2 nd(p n Q) 1 2 nd(p * Q) (n +1) X

Example Compute the probability that in 1000 coin tosses, head shows more than 800 times. S = B(p,1 p) with p 0.8 Q = B(0.5, 0.5) P * = B(0.8, 0.2) D(P * Q) =1 H(P * ) =1 h(0.8) = 0.3 P(S) 2 nd(p* Q) = 2 300!!!!

A more general example We may want to compute # Pr$ % 1 n n & g j (X i ) α j j =1 k ' ( i=1 Sanov theorem with $ S = % P : & x X P(x)g j (x) α j ' j =1 k ( ) We can use Lagrange multipliers to minimize D(P Q) subject to Q in S

A more general example Unconstrained minimization of L(P) = P(x)log P(x) k Q(x) + λ # P(x)g (x) α & # & % ( + β % P(x) 1( j $ j j ' $ ' x j=1 Yielding (after some algebra): x x P * (x) = 1 K Q(x)e j λ j g j (x) with K = x X j Q(x)e λ j g j (x)

A numerical example Compute the probability that the average of n tosses of a fair die is larger than 4 (instead than 3.5) # 6 & S = $ P : xp(x) 4' % ( x=1 From the previous result we have P * (x) = 2λx 6 2 λi i=1 with λ chosen in such a way that Which can be solved numerically (Matlab) 6 xp * (x) = 4 x=1

Homework: how lucky do you need to be? Is it better to bet that head will show up in 3/5 of the tosses of a fair coin or that face 6 will show in 5/18 of the tosses of a fair die?

Conditional limit theorem Not only is the probability of S determined by P *, but P * determines the probabilities of the elements of x n subject to S Theorem Let E be a closed convex set S. Let X i be a sequence of iid RV generated by Q. Let P * be defined as in Sanov theorem. Then Pr { Q X 1 = a P x n S} P * (a) a X

Conditional limit theorem (extension) Theorem Let E be a closed convex set S. Let X i be a sequence of iid RV generated by Q. Let P * be defined as in Sanov theorem. Let m be fixed. Then Pr Q X 1 = a 1, X 2 = a 2 X m = a m P x n S m i=1 { } P * (a 1 ) Remark The theorem holds for any fixed m but not for m = n P * (a)

Homework: a lucky friend Your are told that your friend was so lucky that in a whole night spent at tossing dies face 6 showed up ¼ of the times. Estimate the probability that face 1 never showed in the first 10 tosses Do the same for the first 100 tosses (assuming that in the whole night your friend tossed the coin much more than 100 times).

References 1. T. M. Cover and J. A. Thomas, Elements of Information Theory, Wiley 2. I. Csiszar, The method of types, IEEE Trans. Inf. Theory, vol.44, no.6, pp. 2505 2523, Oct. 1998. 3. I. Csiszar and P. C Shields, Information Theory and Statistics; a Tutorial, Foundations and Trends in Commun. and Inf. Theory, 2004, NOW Pubisher Inc.