Information Theory and Hypothesis Testing

Similar documents
IT and large deviation theory

The method of types. PhD short course Information Theory and Statistics Siena, September, Mauro Barni University of Siena

INFORMATION THEORY AND STATISTICS

Lecture 8: Information Theory and Statistics

Lecture 8: Information Theory and Statistics

Lecture 22: Error exponents in hypothesis testing, GLRT

The Method of Types and Its Application to Information Hiding

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

Information Theory in Intelligent Decision Making

Capacity of AWGN channels

Introduction to Bayesian Statistics

Chapter 11. Information Theory and Statistics

Information measures in simple coding problems

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

Information. = more information was provided by the outcome in #2

A Gentle Tutorial on Information Theory and Learning. Roni Rosenfeld. Carnegie Mellon University

CS 630 Basic Probability and Information Theory. Tim Campbell

Information Theory and Statistics, Part I

Detection theory. H 0 : x[n] = w[n]

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Adversarial Source Identification Game with Corrupted Training

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Lecture 5 - Information theory

Introduction to Machine Learning

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

Introduction to Machine Learning

Lecture Testing Hypotheses: The Neyman-Pearson Paradigm

arxiv: v4 [cs.it] 17 Oct 2015

Lecture 7 Introduction to Statistical Decision Theory

If there exists a threshold k 0 such that. then we can take k = k 0 γ =0 and achieve a test of size α. c 2004 by Mark R. Bell,

Uncertainty. Jayakrishnan Unnikrishnan. CSL June PhD Defense ECE Department

On the Entropy of Sums of Bernoulli Random Variables via the Chen-Stein Method

Robustness and duality of maximum entropy and exponential family distributions

EECS 750. Hypothesis Testing with Communication Constraints

2. What are the tradeoffs among different measures of error (e.g. probability of false alarm, probability of miss, etc.)?

Tight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding

Expectation Maximization

Quantitative Biology II Lecture 4: Variational Methods

Speech Recognition Lecture 7: Maximum Entropy Models. Mehryar Mohri Courant Institute and Google Research

On the Complexity of Best Arm Identification with Fixed Confidence

Context tree models for source coding

University of Siena. Multimedia Security. Watermark extraction. Mauro Barni University of Siena. M. Barni, University of Siena

Correlation Detection and an Operational Interpretation of the Rényi Mutual Information

Exercises with solutions (Set D)

Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics. 1 Executive summary

exp{ (x i) 2 i=1 n i=1 (x i a) 2 (x i ) 2 = exp{ i=1 n i=1 n 2ax i a 2 i=1

Necessary and Sufficient Conditions for High-Dimensional Salient Feature Subset Recovery

DETECTION theory deals primarily with techniques for

Brief Review on Estimation Theory

Chapter 2 Signal Processing at Receivers: Detection Theory

Hypothesis Testing with Communication Constraints

Bayesian Learning. Bayesian Learning Criteria

EE5319R: Problem Set 3 Assigned: 24/08/16, Due: 31/08/16

Chapter 8. Hypothesis Testing. Po-Ning Chen. Department of Communications Engineering. National Chiao-Tung University. Hsin Chu, Taiwan 30050

Strong Converse and Stein s Lemma in the Quantum Hypothesis Testing

Detection Games Under Fully Active Adversaries

Math Bootcamp 2012 Miscellaneous

Detection Theory. Chapter 3. Statistical Decision Theory I. Isael Diaz Oct 26th 2010

ECE 4400:693 - Information Theory

Homework Set #2 Data Compression, Huffman code and AEP

Solutions to Homework Set #1 Sanov s Theorem, Rate distortion

Applications of Information Geometry to Hypothesis Testing and Signal Detection

1 Review of The Learning Setting

Detection Games under Fully Active Adversaries. Received: 23 October 2018; Accepted: 25 December 2018; Published: 29 December 2018

Detection theory 101 ELEC-E5410 Signal Processing for Communications

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Midterm Exam 1 Solution

INFORMATION PROCESSING ABILITY OF BINARY DETECTORS AND BLOCK DECODERS. Michael A. Lexa and Don H. Johnson

16.1 Bounding Capacity with Covering Number

(each row defines a probability distribution). Given n-strings x X n, y Y n we can use the absence of memory in the channel to compute

THE potential for large-scale sensor networks is attracting

Introduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research

Convexity/Concavity of Renyi Entropy and α-mutual Information

Optimal Distributed Detection Strategies for Wireless Sensor Networks

National University of Singapore Department of Electrical & Computer Engineering. Examination for

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Lecture 22: Final Review

QB LECTURE #4: Motif Finding

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

Tight Bounds for Symmetric Divergence Measures and a New Inequality Relating f-divergences

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

Lecture 2: August 31

Coding of memoryless sources 1/35

Classification & Information Theory Lecture #8

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

ECE531 Lecture 6: Detection of Discrete-Time Signals with Random Parameters

SHARED INFORMATION. Prakash Narayan with. Imre Csiszár, Sirin Nitinawarat, Himanshu Tyagi, Shun Watanabe

Communications Theory and Engineering

Computing and Communications 2. Information Theory -Entropy

Large Deviations Performance of Knuth-Yao algorithm for Random Number Generation

The Information Bottleneck Revisited or How to Choose a Good Distortion Measure

A Novel Asynchronous Communication Paradigm: Detection, Isolation, and Coding

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Machine learning - HT Maximum Likelihood

Neural coding Ecological approach to sensory coding: efficient adaptation to the natural environment

Solutions to Set #2 Data Compression, Huffman code and AEP

Lecture 35: December The fundamental statistical distances

Composite Hypotheses and Generalized Likelihood Ratio Tests

Transcription:

Summer School on Game Theory and Telecommunications Campione, 7-12 September, 2014 Information Theory and Hypothesis Testing Mauro Barni University of Siena

September 8 Review of some basic results linking Information Theory and Statistics (Hypothesis Testing) September 9 Application of game theory and IT to adversarial signal processing: Hypothesis Testing game

Outline Review of basic concepts Hypotesis testing Information Theory The method of types in Information Theory Some classical applications Law of large numbers Sanov s theorem and large deviation theory Hypothesis testing

Hypothesis testing

Hypothesis testing: problem definition H 0 ~ P 0 x n H 1 ~ P 1 H 0 or H 1? 0/1 Goal Decide in favor of H 0 or H1 according to a predefined optimality criterion Decision rule: φ(x n ) = 0 /1 Acceptance region: Λ 0 = {x n :φ(x n ) = 0}

Hypothesis testing: problem definition I type error probability (false alarm, false positive): P 1/0 = P f = Pr{φ(X n ) =1/ H 0 } = P 0 (X n Λ 1 ) II type error probability (missed detection, false negative): P 0/1 = P m = Pr{φ(X n ) = 0 / H 1 } = P 1 (X n Λ 0 ) A priori probabilities: P H0, P H1 Costs (losses, risk) : L 1/0, L 0/1

Bayes criterion Minimize Bayes risk φ(x n ) : L =P H0 P 1/0 L 1/0 + P H1 P 0/1 L 0/1 is minimum Optimum acceptance region #% Λ 0 = x n : P 0 (xn ) P 1 (x n ) P H 1 L 0/1 $ &% P H0 L 1/0 '% ( )% Likelihood ratio

Neyman - Pearson criterion Problems with Bayes approach A priori probabilities needed Difficulty of defining costs especially for rare events Neyman Pearson approach Minimize P m subject to a constraint on P f # Λ 0 = x n : P 0 (xn ) $ % P 1 (x n ) T & ' ( Determined by letting P f = λ Likelyhood ratio

An example: Gaussian observables (1/2) P 0 = N(µ 0,σ 2 ) = L(x n ) = e e (x i µ 0 ) 2 i 2σ 2 i (x i µ 1 ) 2 2σ 2 1 2πσ 2 e log (x µ 0 ) 2 2σ 2 P 1 = N(µ 1,σ 2 ) µ 1 > µ 0 (x i µ 1 ) 2 (x i i µ 0 ) 2 i somealgebra 1 X = n i x i? τ! P(X H 0 ) = N µ 0, σ 2 # " n $! &, P(X H 1 ) = N µ 1, σ 2 # % " n $ & %! Determine τ so that N µ 0, σ 2 # " n τ $ & = λ %

An example: Gaussian observables (2/2) P f P m! N µ 0, σ 2 # " n $ & % τ ρ! N µ 1, σ 2 $ # & " n %

ROC curve 1,0E+00 1,0E-01 1,0E-02 1,0E-03 1,0E-04 1,0E-05 Pm 1,0E-06 1,0E-07 1,0E-08 1,0E-09 1,0E-10 1,0E-11 1,0E-12 1,0E-13 1,E-13 1,E-12 1,E-11 1,E-10 1,E-09 1,E-08 1,E-07 1,E-06 1,E-05 1,E-04 1,E-03 1,E-02 1,E-01 1,E+00 P f

Information Theory

Measuring information: Shannon s approach Model a source of information as a random variable Information related to ignorance and impredectability Information is related only to the probability of the events not to their values No attempt to model the importance of information in a given context No link with perceived level of information Other important aspects retained

Axiomatic definition of Entropy Given a discrete source (random variable) X, with alphabet X look for a measure of information with the following properties H 2 (p,1 p) is continuous in p " 1 H 2 2, 1 % $ ' = 1 (normalization, bit) # 2 & " p H m (p 1, p 2... p m ) = H m 1 (p 1 + p 2, p 3... p m )+ (p 1 + p 2 )H 1 p 2, 2 % $ ' Grouping property # p 1 + p 2 p 1 + p 2 & H m (p 1, p 2... p m ) = H m (σ (p 1, p 2... p m )) independence from permutation With these axioms we necessarily have X Source Entropy = H(X) = p i log 2 p i = p(x)log 2 p(x) i=1 x X

Source coding The source coding theorem clarifies the meaning of the Entropy Source Coding Theorem (Shannon 1948): Given a discrete memoryless source X, a lossless coding scheme exists if and only if the rate of the code is lower than the entropy R < H. The entropy catches the essence of information in the sense that it gives the minimum number bits necessary to describe the output of a source

Other quantities We can define several other quantities catching different aspects of the information measure when two or more sources are involved H(X,Y ) = p(x, y)log p(x, y) x y H(X Y ) = p(x, y)log p(x y) x I(X;Y ) = p(x, y)log x y y p(x, y) p(x)p(y)

Relative entropy In the rest of the talk we will make an extensive use of the following quantity (also called Divergence, Kullback-Leibler distance): D(P Q) = x X P(x)log P(x) Q(x) Divergence can be interpreted as a kind of distance between pmf s D(P Q) > 0 D(P Q) = 0 P Q iff P = Q

The method of types

Type or empirical probability Type, or empirical probability, of a sequence P x n (a) = N(a xn ) n a X Set with all the types with denominator n P n = all types with denominator n '! 1 if X = {0,1} P 5 = ( 0,1), 5, 4 $! 2 # &, " 5% 5, 3 $! 3 # &, " 5% 5, 2 $! 4 # &, " 5% 5, 1 $ ( # &, 1, 0 ) " 5% ( ) * +,

Type class Type class: all the sequences having the same type T(P) = { x n X n : P x n = P} Example: x 5 = 01100! P x 5 = 3 5, 2 $ # & " 5% T ( P ) x 5 = ') ( *) 11000,10100,10010,10001, 01100 01010, 01001, 00110, 00101, 00011 + ), -)

Number of types The number of types grows polynomially with n Theorem The number of types with denominator n is upper bounded by: P n (n +1) X Proof. Obvious.

Probability of a sequence Theorem The probability that a sequence x = x n is emitted by a DMS source with pmf Q is Q(x) = 2 n ( H (P x ) +D(P x Q) ) if P x = Q Q(x) = 2 nh (P x ) nh (Q) = 2 Remember The larger the KL distance from the type of x and Q the lower the probability.

Probability of a sequence Proof. i Q(x) = Q(x i ) = a X Q(a) N (a x) = Q(a) np x (a) = 2 np x (a)logq(a) a X a X a X = 2 n[p x (a)logq(a) P x (a)log P x (a)+p x (a)log P x (a)] = 2 n a " P x (a)log P x (a) Q(a) +P % $ x (a)log P x (a)' # & = 2 n [ H (P x )+D(P x Q) ]

Size of a type class Theorem The size of a type class T(P) can be bounded as follows: 1 (n +1) X 2 nh (P) T(P) 2 nh (P) Remember The size of a type class grows exponentially with growing rate equal to the entropy of the type.

Size of a type class Proof. (upper bound) Given P P n consider the probability that a source with pmf P emits a sequence in T(P). We have 1 P(x) = 2 x T (P) x T (P) nh (P) nh (P) = T(P) 2 nh (P) T(P) 2

Size of a type class Proof. (lower bound)! T(P) = # " n np(a 1 )... np(a X ) $ & = n! % n 1!n 2! n X!! # " n e $ & % n T(P)! n! n n $ # & " e % " n n 1 1 $ # e n " $ # n e n % ' & % 1 'n n & X " $ # Stirling approximation after some algebra n X e n % X ' & T(P) 1 (n +1) X 2 nh (P)

Probability of a type class Theorem The probability that a DMS with pmf Q emits a sequence belonging to T(P) can be bounded as follows: 1 (n +1) X 2 nd(p Q) Q(T(P)) 2 nd(p Q) Remember The larger the KL distance between P and Q the smaller the probability. If P=Q, the probability tends to 1 exponentially fast

Probability of a type class Proof. Q(T(P)) = Q(x) = 2 x T (P) x T (P) n(h (P)+D(P Q)) n(h (P)+D(P Q)) = T(P) 2 By remembering the bounds on the size of T(P): 1 (n +1) X 2 nd(p Q) Q(T(P)) 2 nd(p Q)

In summary P n (n +1) X Q(x) = 2 n[d(p x Q)+H (P x )] nh (P) T(P) 2 Q(T(P)) 2 nd(p Q)

Information Theory and Statistics

Law of large numbers Q(T(P)) 2 nd(p Q) When n grows the only type class with a non-negligible probability is Q Theorem (law of large numbers) T ε Q = { x n : D(P x n Q) ε} P(x n T Q ε ) = Q(T(P)) 2 nd(p Q) 2 nε P:D(P Q)>ε P:D(P Q)>ε P:D(P Q)>ε (n +1) X 2 nε = 2 # n ε X $ % log(n+1) n & ' ( That tends to 0 when n tends to infinity

Large deviation theory LDT studies the probability of rare events, i.e. events not covered by the law of large numbers Examples What is the probability that in 1000 fair coin tosses head appears 800 times? Compute the probability that the mean value of a sequence (emitted by a DMS X) is larger than T, with T much larger than E[X]. Rare events in statistical physics or economics

Large deviation theory More formally: let S be a subset of pmf s and let Q be a source. We want to compute the probability that Q emits a sequence whose type belongs to S Q(S) = x:p x S Q(x) Example: What is the probability that the average value of a sequence drawn from Q is larger than 4? Above problem with S = pmf s such that E[S] > 4.

Large deviation theory If S contains a KL neighborhood of Q, then Q(S) -> 1 If S does not contain Q or a KL neighborhood of Q, then Q(S) -> 0. The question is: how fast?. Q S S. Q

Sanov s theorem Theorem (Sanov) Let S be a regular set of pmf s (cl(int(s) = S), then Q(S) 2 nd(p* Q) P * = argmin P S D(P Q) S. P *. Q

Sanov s theorem Proof. (upper bound) Q(S) = Q(T(P)) 2 nd(p Q) 2 nmin P S P n D(P Q) P S P n P S P n P S P n 2 nmin P S D(P Q) = 2 nd(p* Q) P S P n P S P n (n +1) X 2 nd(p* Q)

Sanov s theorem Proof. (lower bound) Due to the regularity of S and the density of P n n in the set of all pmf's we can find a sequence P n P n such that P n P * and hence D(P n Q) D(P * Q). Then for large n we can write: Q(S) = Q(T(P)) Q(T(P n )) P S P n 1 (n +1) X 2 nd(p n Q) 1 2 nd(p * Q) (n +1) X

Example Compute the probability that in 1000 coin tosses, head shows more than 800 times. S = B(p,1 p) with p 0.8 Q = B(0.5, 0.5) P * = B(0.8, 0.2) D(P * Q) =1 H(P * ) =1 h(0.8) = 0.3 P(S) 2 nd(p* Q) = 2 300!!!!

Hypothesis testing Consider two hypoteses H 0 and H 1 and a sequence of observations x n H 0 X ~ P 0 H 1 X ~ P 1 Neyman-Pearson criterion: minimize P 0 1 for a fixed P 1 0 Decide for H 0 if P 0 (x n ) P 1 (x n ) T Likelihood ratio, T depends on P 1 0

Hypothesis testing Let us pass to the log-likelihood ratio and assume a DMS: log P 0(x n ) P 1 (x n ) = log P 0 (x i ) = log P 0(x i ) i P 1 (x i ) P 1 (x i ) = a X a X np x (a)log np x (a)log P 0 (a) P 1 (a) = P x (a) P 1 (a) a X a X The N-P criterion boils down to i np x (a)log np x(a)log P 0 (a) P 1 (a) P x (a) P x (a) = P x (a) P 0 (a)? logt D(P x P 1 ) D(P x P 0 )? logt n = τ

Error exponents in HT and Sanov s theorem 1 λ = lim n n log P 1 0 ε = lim n 1 n log P 0 1 Λ 0 = {P : D(P P 0 ) D(P P 1 ) < τ} λ = D(P 0 * P 0 ) ε = D(P 1 * P 1 ) P 0 * = argmin P Λ 0 D(P P 0 ) P 1 * = argmin P Λ 0 D(P P 1 ) It can be proven that P 0 * = P 1 * = P *.. λ = D(P * P 0 ) P 1 ε = D(P * P 1 ) P *. P0 Λ 0

Best achievable error exp: Stein s lemma If it is enough that P 1 0 tends to zero exponentially, regardless of the error exponent, we can fix the threshold so that λ is arbitrarily small (yet positive). ε. λ P 1. P* P 0 Λ 0 λ 0 P * P 0 ε D(P 0 P 1 ) which is the best achievable error exp. (for P 0 1 ) for the test

Other links between IT and Statistics Chernoff bound Estimation Theory: Cramer-Rao bound...

Further readings 1. T. M. Cover and J. A. Thomas, Elements of Information Theory, Wiley 2. I. Csiszar, The method of types, IEEE Trans. Inf. Theory, vol.44, no.6, pp. 2505 2523, Oct. 1998. 3. I. Csiszar and P. C Shields, Information Theory and Statistics; a Tutorial, Foundations and Trends in Commun. and Inf. Theory, 2004, NOW Pubisher Inc.