Information Theory and Hypothesis Testing

Summer School on Game Theory and Telecommunications Campione, 7-12 September, 2014 Information Theory and Hypothesis Testing Mauro Barni University of Siena

September 8 Review of some basic results linking Information Theory and Statistics (Hypothesis Testing) September 9 Application of game theory and IT to adversarial signal processing: Hypothesis Testing game

Outline Review of basic concepts Hypotesis testing Information Theory The method of types in Information Theory Some classical applications Law of large numbers Sanov s theorem and large deviation theory Hypothesis testing

Hypothesis testing

Hypothesis testing: problem definition H 0 ~ P 0 x n H 1 ~ P 1 H 0 or H 1? 0/1 Goal Decide in favor of H 0 or H1 according to a predefined optimality criterion Decision rule: φ(x n ) = 0 /1 Acceptance region: Λ 0 = {x n :φ(x n ) = 0}

Hypothesis testing: problem definition I type error probability (false alarm, false positive): P 1/0 = P f = Pr{φ(X n ) =1/ H 0 } = P 0 (X n Λ 1 ) II type error probability (missed detection, false negative): P 0/1 = P m = Pr{φ(X n ) = 0 / H 1 } = P 1 (X n Λ 0 ) A priori probabilities: P H0, P H1 Costs (losses, risk) : L 1/0, L 0/1

Bayes criterion Minimize Bayes risk φ(x n ) : L =P H0 P 1/0 L 1/0 + P H1 P 0/1 L 0/1 is minimum Optimum acceptance region #% Λ 0 = x n : P 0 (xn ) P 1 (x n ) P H 1 L 0/1 $ &% P H0 L 1/0 '% ( )% Likelihood ratio

Neyman - Pearson criterion Problems with Bayes approach A priori probabilities needed Difficulty of defining costs especially for rare events Neyman Pearson approach Minimize P m subject to a constraint on P f # Λ 0 = x n : P 0 (xn ) $ % P 1 (x n ) T & ' ( Determined by letting P f = λ Likelyhood ratio

An example: Gaussian observables (1/2) P 0 = N(µ 0,σ 2 ) = L(x n ) = e e (x i µ 0 ) 2 i 2σ 2 i (x i µ 1 ) 2 2σ 2 1 2πσ 2 e log (x µ 0 ) 2 2σ 2 P 1 = N(µ 1,σ 2 ) µ 1 > µ 0 (x i µ 1 ) 2 (x i i µ 0 ) 2 i somealgebra 1 X = n i x i? τ! P(X H 0 ) = N µ 0, σ 2 # " n $! &, P(X H 1 ) = N µ 1, σ 2 # % " n $ & %! Determine τ so that N µ 0, σ 2 # " n τ $ & = λ %

An example: Gaussian observables (2/2) P f P m! N µ 0, σ 2 # " n $ & % τ ρ! N µ 1, σ 2 $ # & " n %

ROC curve 1,0E+00 1,0E-01 1,0E-02 1,0E-03 1,0E-04 1,0E-05 Pm 1,0E-06 1,0E-07 1,0E-08 1,0E-09 1,0E-10 1,0E-11 1,0E-12 1,0E-13 1,E-13 1,E-12 1,E-11 1,E-10 1,E-09 1,E-08 1,E-07 1,E-06 1,E-05 1,E-04 1,E-03 1,E-02 1,E-01 1,E+00 P f

Information Theory

Measuring information: Shannon s approach Model a source of information as a random variable Information related to ignorance and impredectability Information is related only to the probability of the events not to their values No attempt to model the importance of information in a given context No link with perceived level of information Other important aspects retained

Axiomatic definition of Entropy Given a discrete source (random variable) X, with alphabet X look for a measure of information with the following properties H 2 (p,1 p) is continuous in p " 1 H 2 2, 1 % $ ' = 1 (normalization, bit) # 2 & " p H m (p 1, p 2... p m ) = H m 1 (p 1 + p 2, p 3... p m )+ (p 1 + p 2 )H 1 p 2, 2 % $ ' Grouping property # p 1 + p 2 p 1 + p 2 & H m (p 1, p 2... p m ) = H m (σ (p 1, p 2... p m )) independence from permutation With these axioms we necessarily have X Source Entropy = H(X) = p i log 2 p i = p(x)log 2 p(x) i=1 x X

Source coding The source coding theorem clarifies the meaning of the Entropy Source Coding Theorem (Shannon 1948): Given a discrete memoryless source X, a lossless coding scheme exists if and only if the rate of the code is lower than the entropy R < H. The entropy catches the essence of information in the sense that it gives the minimum number bits necessary to describe the output of a source

Other quantities We can define several other quantities catching different aspects of the information measure when two or more sources are involved H(X,Y ) = p(x, y)log p(x, y) x y H(X Y ) = p(x, y)log p(x y) x I(X;Y ) = p(x, y)log x y y p(x, y) p(x)p(y)

Relative entropy In the rest of the talk we will make an extensive use of the following quantity (also called Divergence, Kullback-Leibler distance): D(P Q) = x X P(x)log P(x) Q(x) Divergence can be interpreted as a kind of distance between pmf s D(P Q) > 0 D(P Q) = 0 P Q iff P = Q

The method of types

Type or empirical probability Type, or empirical probability, of a sequence P x n (a) = N(a xn ) n a X Set with all the types with denominator n P n = all types with denominator n '! 1 if X = {0,1} P 5 = ( 0,1), 5, 4 $! 2 # &, " 5% 5, 3 $! 3 # &, " 5% 5, 2 $! 4 # &, " 5% 5, 1 $ ( # &, 1, 0 ) " 5% ( ) * +,

Type class Type class: all the sequences having the same type T(P) = { x n X n : P x n = P} Example: x 5 = 01100! P x 5 = 3 5, 2 $ # & " 5% T ( P ) x 5 = ') ( *) 11000,10100,10010,10001, 01100 01010, 01001, 00110, 00101, 00011 + ), -)

Number of types The number of types grows polynomially with n Theorem The number of types with denominator n is upper bounded by: P n (n +1) X Proof. Obvious.

Probability of a sequence Theorem The probability that a sequence x = x n is emitted by a DMS source with pmf Q is Q(x) = 2 n ( H (P x ) +D(P x Q) ) if P x = Q Q(x) = 2 nh (P x ) nh (Q) = 2 Remember The larger the KL distance from the type of x and Q the lower the probability.

Probability of a sequence Proof. i Q(x) = Q(x i ) = a X Q(a) N (a x) = Q(a) np x (a) = 2 np x (a)logq(a) a X a X a X = 2 n[p x (a)logq(a) P x (a)log P x (a)+p x (a)log P x (a)] = 2 n a " P x (a)log P x (a) Q(a) +P % $ x (a)log P x (a)' # & = 2 n [ H (P x )+D(P x Q) ]

Size of a type class Theorem The size of a type class T(P) can be bounded as follows: 1 (n +1) X 2 nh (P) T(P) 2 nh (P) Remember The size of a type class grows exponentially with growing rate equal to the entropy of the type.

Size of a type class Proof. (upper bound) Given P P n consider the probability that a source with pmf P emits a sequence in T(P). We have 1 P(x) = 2 x T (P) x T (P) nh (P) nh (P) = T(P) 2 nh (P) T(P) 2

Size of a type class Proof. (lower bound)! T(P) = # " n np(a 1 )... np(a X ) $ & = n! % n 1!n 2! n X!! # " n e $ & % n T(P)! n! n n $ # & " e % " n n 1 1 $ # e n " $ # n e n % ' & % 1 'n n & X " $ # Stirling approximation after some algebra n X e n % X ' & T(P) 1 (n +1) X 2 nh (P)

Probability of a type class Theorem The probability that a DMS with pmf Q emits a sequence belonging to T(P) can be bounded as follows: 1 (n +1) X 2 nd(p Q) Q(T(P)) 2 nd(p Q) Remember The larger the KL distance between P and Q the smaller the probability. If P=Q, the probability tends to 1 exponentially fast

Probability of a type class Proof. Q(T(P)) = Q(x) = 2 x T (P) x T (P) n(h (P)+D(P Q)) n(h (P)+D(P Q)) = T(P) 2 By remembering the bounds on the size of T(P): 1 (n +1) X 2 nd(p Q) Q(T(P)) 2 nd(p Q)

In summary P n (n +1) X Q(x) = 2 n[d(p x Q)+H (P x )] nh (P) T(P) 2 Q(T(P)) 2 nd(p Q)

Information Theory and Statistics

Law of large numbers Q(T(P)) 2 nd(p Q) When n grows the only type class with a non-negligible probability is Q Theorem (law of large numbers) T ε Q = { x n : D(P x n Q) ε} P(x n T Q ε ) = Q(T(P)) 2 nd(p Q) 2 nε P:D(P Q)>ε P:D(P Q)>ε P:D(P Q)>ε (n +1) X 2 nε = 2 # n ε X $ % log(n+1) n & ' ( That tends to 0 when n tends to infinity

Large deviation theory LDT studies the probability of rare events, i.e. events not covered by the law of large numbers Examples What is the probability that in 1000 fair coin tosses head appears 800 times? Compute the probability that the mean value of a sequence (emitted by a DMS X) is larger than T, with T much larger than E[X]. Rare events in statistical physics or economics

Large deviation theory More formally: let S be a subset of pmf s and let Q be a source. We want to compute the probability that Q emits a sequence whose type belongs to S Q(S) = x:p x S Q(x) Example: What is the probability that the average value of a sequence drawn from Q is larger than 4? Above problem with S = pmf s such that E[S] > 4.

Large deviation theory If S contains a KL neighborhood of Q, then Q(S) -> 1 If S does not contain Q or a KL neighborhood of Q, then Q(S) -> 0. The question is: how fast?. Q S S. Q

Sanov s theorem Theorem (Sanov) Let S be a regular set of pmf s (cl(int(s) = S), then Q(S) 2 nd(p* Q) P * = argmin P S D(P Q) S. P *. Q

Sanov s theorem Proof. (upper bound) Q(S) = Q(T(P)) 2 nd(p Q) 2 nmin P S P n D(P Q) P S P n P S P n P S P n 2 nmin P S D(P Q) = 2 nd(p* Q) P S P n P S P n (n +1) X 2 nd(p* Q)

Sanov s theorem Proof. (lower bound) Due to the regularity of S and the density of P n n in the set of all pmf's we can find a sequence P n P n such that P n P * and hence D(P n Q) D(P * Q). Then for large n we can write: Q(S) = Q(T(P)) Q(T(P n )) P S P n 1 (n +1) X 2 nd(p n Q) 1 2 nd(p * Q) (n +1) X

Example Compute the probability that in 1000 coin tosses, head shows more than 800 times. S = B(p,1 p) with p 0.8 Q = B(0.5, 0.5) P * = B(0.8, 0.2) D(P * Q) =1 H(P * ) =1 h(0.8) = 0.3 P(S) 2 nd(p* Q) = 2 300!!!!

Hypothesis testing Consider two hypoteses H 0 and H 1 and a sequence of observations x n H 0 X ~ P 0 H 1 X ~ P 1 Neyman-Pearson criterion: minimize P 0 1 for a fixed P 1 0 Decide for H 0 if P 0 (x n ) P 1 (x n ) T Likelihood ratio, T depends on P 1 0

Hypothesis testing Let us pass to the log-likelihood ratio and assume a DMS: log P 0(x n ) P 1 (x n ) = log P 0 (x i ) = log P 0(x i ) i P 1 (x i ) P 1 (x i ) = a X a X np x (a)log np x (a)log P 0 (a) P 1 (a) = P x (a) P 1 (a) a X a X The N-P criterion boils down to i np x (a)log np x(a)log P 0 (a) P 1 (a) P x (a) P x (a) = P x (a) P 0 (a)? logt D(P x P 1 ) D(P x P 0 )? logt n = τ

Error exponents in HT and Sanov s theorem 1 λ = lim n n log P 1 0 ε = lim n 1 n log P 0 1 Λ 0 = {P : D(P P 0 ) D(P P 1 ) < τ} λ = D(P 0 * P 0 ) ε = D(P 1 * P 1 ) P 0 * = argmin P Λ 0 D(P P 0 ) P 1 * = argmin P Λ 0 D(P P 1 ) It can be proven that P 0 * = P 1 * = P *.. λ = D(P * P 0 ) P 1 ε = D(P * P 1 ) P *. P0 Λ 0

Best achievable error exp: Stein s lemma If it is enough that P 1 0 tends to zero exponentially, regardless of the error exponent, we can fix the threshold so that λ is arbitrarily small (yet positive). ε. λ P 1. P* P 0 Λ 0 λ 0 P * P 0 ε D(P 0 P 1 ) which is the best achievable error exp. (for P 0 1 ) for the test

Other links between IT and Statistics Chernoff bound Estimation Theory: Cramer-Rao bound...

Further readings 1. T. M. Cover and J. A. Thomas, Elements of Information Theory, Wiley 2. I. Csiszar, The method of types, IEEE Trans. Inf. Theory, vol.44, no.6, pp. 2505 2523, Oct. 1998. 3. I. Csiszar and P. C Shields, Information Theory and Statistics; a Tutorial, Foundations and Trends in Commun. and Inf. Theory, 2004, NOW Pubisher Inc.