Chapter 8. Hypothesis Testing. Po-Ning Chen. Department of Communications Engineering. National Chiao-Tung University. Hsin Chu, Taiwan 30050

Similar documents
Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

Lecture 8: Information Theory and Statistics

Lecture 8: Information Theory and Statistics

Information Theory and Hypothesis Testing

INFORMATION THEORY AND STATISTICS

EECS 750. Hypothesis Testing with Communication Constraints

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

Lecture 22: Error exponents in hypothesis testing, GLRT

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Lecture 7 Introduction to Statistical Decision Theory

ECE 4400:693 - Information Theory

Capacity of AWGN channels

Second-Order Asymptotics in Information Theory

Section 27. The Central Limit Theorem. Po-Ning Chen, Professor. Institute of Communications Engineering. National Chiao Tung University

Channel Dispersion and Moderate Deviations Limits for Memoryless Channels

Information measures in simple coding problems

Exercises and Answers to Chapter 1

National University of Singapore Department of Electrical & Computer Engineering. Examination for

ELEC546 Review of Information Theory

A Very Brief Summary of Statistical Inference, and Examples

40.530: Statistics. Professor Chen Zehua. Singapore University of Design and Technology

EE5319R: Problem Set 3 Assigned: 24/08/16, Due: 31/08/16

Quiz 2 Date: Monday, November 21, 2016

Solutions to Homework Set #1 Sanov s Theorem, Rate distortion

Lecture 21. Hypothesis Testing II

First and Last Name: 2. Correct The Mistake Determine whether these equations are false, and if so write the correct answer.

LECTURE 10. Last time: Lecture outline

Correlation Detection and an Operational Interpretation of the Rényi Mutual Information

1 + lim. n n+1. f(x) = x + 1, x 1. and we check that f is increasing, instead. Using the quotient rule, we easily find that. 1 (x + 1) 1 x (x + 1) 2 =

A Tight Upper Bound on the Second-Order Coding Rate of Parallel Gaussian Channels with Feedback

Scientific Computing

Lecture 11: Quantum Information III - Source Coding

Mathematical Methods for Neurosciences. ENS - Master MVA Paris 6 - Master Maths-Bio ( )

Lecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157

Robustness and duality of maximum entropy and exponential family distributions

Asymptotic Statistics-III. Changliang Zou

Lecture 4 Noisy Channel Coding

Theory and Applications of Stochastic Systems Lecture Exponential Martingale for Random Walk

A Hierarchy of Information Quantities for Finite Block Length Analysis of Quantum Tasks

Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 2012

Self-normalized Cramér-Type Large Deviations for Independent Random Variables

Probability and Measure

Arimoto Channel Coding Converse and Rényi Divergence

Information Theory in Intelligent Decision Making

Strong Converse and Stein s Lemma in the Quantum Hypothesis Testing

Necessary and Sufficient Conditions for High-Dimensional Salient Feature Subset Recovery

Laplace s Equation. Chapter Mean Value Formulas

Concentration Inequalities

Gärtner-Ellis Theorem and applications.

Part 3.3 Differentiation Taylor Polynomials

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Parameter Estimation

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Selected Exercises on Expectations and Some Probability Inequalities

Section 10: Role of influence functions in characterizing large sample efficiency

AP Calculus Chapter 9: Infinite Series

ECE531 Lecture 4b: Composite Hypothesis Testing

Chapter 9 Fundamental Limits in Information Theory

An introduction to basic information theory. Hampus Wessman

Arimoto-Rényi Conditional Entropy. and Bayesian M-ary Hypothesis Testing. Abstract

Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science Transmission of Information Spring 2006

16.1 Bounding Capacity with Covering Number

Minimax Estimation of Kernel Mean Embeddings

Taylor Series. richard/math230 These notes are taken from Calculus Vol I, by Tom M. Apostol,

Lecture 21: Minimax Theory

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2011

Lecture 6 I. CHANNEL CODING. X n (m) P Y X

Introduction to Self-normalized Limit Theory

ERRATA: Probabilistic Techniques in Analysis

Mathematical Methods for Physics and Engineering

Course 212: Academic Year Section 1: Metric Spaces

Solution Set for Homework #1

EE376A - Information Theory Final, Monday March 14th 2016 Solutions. Please start answering each question on a new page of the answer booklet.

A Very Brief Summary of Statistical Inference, and Examples

CALCULUS JIA-MING (FRANK) LIOU

The Maximum-Likelihood Soft-Decision Sequential Decoding Algorithms for Convolutional Codes

MAT 135B Midterm 1 Solutions

MATH 117 LECTURE NOTES

Universal Anytime Codes: An approach to uncertain channels in control

A One-to-One Code and Its Anti-Redundancy

Refined Bounds on the Empirical Distribution of Good Channel Codes via Concentration Inequalities

Lecture 5: Asymptotic Equipartition Property

18.2 Continuous Alphabet (discrete-time, memoryless) Channel

Master s Written Examination

Dispersion of the Gilbert-Elliott Channel

Recitation 2: Probability

Lecture 2: August 31

Lecture 5 Channel Coding over Continuous Channels

n! (k 1)!(n k)! = F (X) U(0, 1). (x, y) = n(n 1) ( F (y) F (x) ) n 2

LECTURE NOTES 57. Lecture 9

Chapter 3, 4 Random Variables ENCS Probability and Stochastic Processes. Concordia University

Lecture 2. Capacity of the Gaussian channel

Multiple Random Variables

Convergence of Square Root Ensemble Kalman Filters in the Large Ensemble Limit

n! (k 1)!(n k)! = F (X) U(0, 1). (x, y) = n(n 1) ( F (y) F (x) ) n 2

topics about f-divergence

A proof of a partition conjecture of Bateman and Erdős

EE514A Information Theory I Fall 2013

Transcription:

Chapter 8 Hypothesis Testing Po-Ning Chen Department of Communications Engineering National Chiao-Tung University Hsin Chu, Taiwan 30050

Error exponent and divergence II:8-1 Definition 8.1 (exponent) Arealnumbera is said to be the exponent for a sequence of non-negative quantities {a n } n 1 converging to zero, if ( a = lim 1 ) n n log a n. In operation, exponent is an index for the exponential rate-of-convergence for sequence a n. We can say that for any γ>0, as n large enough. e n(a+γ) a n e n(a γ), Recall that in proving the channel coding theorem, the probability of decoding error for channel block codes can be made arbitrarily close to zero when the rate of the codes is less than channel capacity. Actually, this result can be mathematically written as: P e ( C n ) 0asn, provided R = lim sup n (1/n)log C n <C,where C n is the optimal code for block length n. From the theorem, we only know the decoding error will vanish as block length increases; but, it does not reveal how fast the decoding error approaches zero.

Error exponent and divergence II:8-2 In other words, we do not know the rate-of-convergence of the decoding error. Sometimes, this information is very important, especially for one to decide the sufficient block length to achieve some error bound. The first step of investigating the rate-of-convergence of the decoding error is to compute its exponent, if the decoding error decays to zero exponentially fast (it indeed does for memoryless channels.) This exponent, as a function of the rate, is in fact called the channel reliability function, and will be discussed in the next chapter. For the hypothesis testing problems, the type II error probability of fixed test level also decays to zero as the number of observations increases. As it turns out, its exponent is the divergence of the null hypothesis distribution against alternative hypothesis distribution.

Stein s lemma II:8-3 Lemma 8.2 (Stein s lemma) For a sequence of i.i.d. observations X n which is possibly drawn from either null hypothesis distribution P X n or alternative hypothesis distribution P ˆXn, the type II error satisfies ( ε (0, 1)) lim 1 n n log β n(ε) =D(P X P ˆX), where βn (ε) =min α n ε β n,andα n and β n represent the type I and type II errors respectively. Proof: [1. Forward Part] In the forward part, we prove that there exists an acceptance region for null hypothesis such that lim inf 1 n n log β n(ε) D(P X P ˆX). step 1: divergence typical set. For any δ>0, define divergence typical set as { A n (δ) = x n : 1 n log P Xn(x n } ) P ˆXn(x n ) D(P X P ˆX) <δ. Note that in divergence typical set, P ˆXn(x n ) P X n(x n )e n(d(p X P ˆX) δ).

Stein s lemma II:8-4 step 2: computation of type I error. By weak law of large number, P X n(a n (δ)) 1. Hence, for sufficiently large n. α n = P X n(a c n(δ)) <ε, step 3: computation of type II error. β n (ε) = P ˆXn(A n (δ)) = P ˆXn(x n ) x n A n (δ) x n A n (δ) e n(d(p X P ˆX) δ) P X n(x n )e n(d(p X P ˆX) δ) P X n(x n ) x n A n (δ) e n(d(p X P ˆX) δ) (1 α n ). Hence, 1 n log β n(ε) D(P X P ˆX) δ + 1 n log(1 α n),

Stein s lemma II:8-5 which implies lim inf 1 n n log β n(ε) D(P X P ˆX) δ. The above inequality is true for any δ>0. Therefore lim inf n 1 n log 2 β n (ε) D(P X P ˆX). [2. Converse Part] In the converse part, we will prove that for any acceptance region for null hypothesis B n satisfying the type I error constraint, i.e., its type II error β n (B n ) satisfies lim sup n α n (B n )=P X n(b c n) ε, 1 n log β n(b n ) D(P X P ˆX).

Stein s lemma II:8-6 Hence, β n (B n )=P ˆXn(B n ) P ˆXn(B n A n (δ)) P ˆXn(x n ) x n B n A n (δ) x n B n A n (δ) P X n(x n )e n(d(p X P ˆX)+δ) = e n(d(p X P ˆX)+δ) P X n(b n A n (δ)) e n(d(p X P ˆX)+δ) (1 P X n(bn) c P X n (A c n(δ))) e n(d(p X P ˆX)+δ) (1 α n (B n ) P X n (A c n (δ))) e n(d(p X P ˆX)+δ) (1 ε P X n (A c n(δ))). 1 n log β n(b n ) D(P X P ˆX)+δ + 1 n log (1 ε P X n (Ac n(δ))), which implies that lim sup 1 n n log β n(b n ) D(P X P ˆX)+δ. The above inequality is true for any δ>0. Therefore, lim sup n 1 n log β n(b n ) D(P X P ˆX).

Composition of sequence of i.i.d. observations II:8-7 Stein s lemma gives the exponent of the type II error probability for fixed test level. As a result, this exponent, which is the divergence of null hypothesis distribution against alternative hypothesis distribution, is independent of the type I error bound ε for i.i.d. observations. Specifically under i.i.d. environment, the probability for each sequence of x n depends only on its composition, which is defined as an X -dimensional vector, and is of the form (#1(x n ) ), #2(xn ),..., #k(xn ), n n n where X = {1, 2,...,k}, and#i(x n ) is the number of occurrences of symbol i in x n. The probability of x n is therefore can be written as P X n(x n )=P X (1) #1(xn) P X (2) #2(xn) P X (k) #k(xn). Note that #1(x n )+ +#k(x n )=n. Since the composition of sequence decides its probability deterministically, all sequences with the same composition should have the same statistical property, and hence should be treated the same when processing.

Composition of sequence of i.i.d. observations II:8-8 Instead of manipulating the sequences of observations based on the typical-setlike concept, we may focus on their compositions. As it turns out, such approach yields simpler proofs and better geometrical explanations for theories under i.i.d. environment. (It needs to be pointed out that for cases when composition alone can not decide the probability, this viewpoint does not seem to be effective.) Lemma 8.3 (polynomial bound on number of composition) The number of compositions increases polynomially fast, while the number of possible sequences increases exponentially fast. Proof: Let P n denotes the set of all possible compositions. P n (n +1) X

Composition of sequence of i.i.d. observations II:8-9 Lemma 8.4 (probability of sequences of the same composition) The probability of the sequences of composition C with respect to distribution P X n satisfies 1 (n +1) X e nd(p C P X ) P X n(c) e nd(p C P X ), where P C is the composition distribution for composition C, andc (by abusing notation without ambiguity) is also used to represent the set of all sequences (in X n ) of composition C. Theorem 8.5 (Sanov s Theorem) Let E n be the set that consists of all compositions over finite alphabet X, whose composition distribution belongs to P. Fix a sequence of product distribution P X n = n i=1 P X. Then, lim inf n 1 n log P X n(e n) inf P C P D(P P X), where P C is the composition distribution for composition C. If, in addition, for every distribution P in P, there exists a sequence of composition distributions P C1,P C2,P C3,... Psuch that lim sup n D(P Cn P X )=D(P P X ), then lim sup n 1 n log P X n(e n) inf P P D(P C P X ).

Geometrical interpretation for Sanov s theorem II:8-10 P min P 1 P 2 P X P P D(P P X) The geometric meaning for Sanov s theorem.

Geometrical interpretation for Sanov s theorem II:8-11 Example 8.6 Question: One wants to roughly estimate the probability that the average of the throws is greater or equal to 4, when tossing a fair dice n times. He observes that whether the requirement is satisfied only depends on the compositions of the observations. Let E n be the set of compositions which satisfy the requirement. { } 6 E n = C : ip C (i) 4. i=1 To minimize D(P C P X )forc E n, we can use the Lagrange multiplier technique (since divergence is convex with respect to the first argument.) with the constraints on P C being: 6 6 ip C (i) =k and P C (i) =1 for k =4, 5, 6,...,n. i=1 So it becomes to minimize: ( 6 6 ) ( 6 ) P C (i)log P C(i) P i=1 X (i) + λ 1 ip C (i) k + λ 2 P C (i) 1. i=1 i=1 i=1

Geometrical interpretation for Sanov s theorem II:8-12 By taking the derivatives, we found that the minimizer should be of the form for λ 1 is chosen to satisfy P C (i) = e λ 1 i 6 j=1 eλ 1 j, 6 ip C (i) =k. (8.1.1) i=1 Since the above is true for all k 4, it suffices to take the smallest one as our solution, i.e., k =4. Finally, by solving (8.1.1) for k = 4 numerically, the minimizer should be P C =(0.1031, 0.1227, 0.1461, 0.1740, 0.2072, 0.2468), and the exponent of the desired probability is D(P C P X )=0.0433 nat. Consequently, P X n(e n ) e 0.0433 n.

Divergence typical set on composition II:8-13 Divergence typical set in Stein s lemma: { A n (δ) = x n : 1 n log P Xn(x n } ) P ˆXn(x n ) D(P X P ˆX) <δ. T n (δ) ={x n X n where C x n represents the composition of x n. P X n(t n (δ)) 1isjustifiedby 1 P X n(t n (δ)) = {C : D(P C P X )>δ} {C : D(P C P X )>δ} {C : D(P C P X )>δ} : D(P Cx n P X ) δ}, P X n(c) e nd(p C P X ), from Lemma 8.4. e nδ (n +1) X e nδ, cf. Lemma 8.3.

Universal source coding on composition II:8-14 Universal code for i.i.d. source: as n goes to infinity. 1 n f n : X n {0, 1} i i=1 x n X n P X n(x n )l(f n (x n )) H(X), Example 8.7 (universal encoding using compositions) Binary-index the compositions using log 2 (n +1) X bits, and denote this binary index for composition C by a(c). Let C x n denote the composition with respect to x n, i.e. x n C x n. Binary-index the elements in C using n H(P C ) bits, and denote this binary index for elements in C by b(c x n). For each composition C, we know that the number of sequence x n in C is at most 2 n H(P C) (Here, H(P C ) is measured in bits. I.e., the logarithmic base in entropy is 2. See the proof of Lemma 8.4).

Universal source coding on composition II:8-15 Define a universal encoding function f n as f n (x n ) = concatenation{a(c x n),b(c x n)}. Then this encoding rule is a universal code for all i.i.d. sources.

Universal source coding on composition II:8-16 Proof: l n = P X n(x n )l(a(c x n)) + P X n(x n )l(b(c x n)) x n X n x n X n P X n(x n ) log 2 (n +1) X + P X n(x n ) n H(P Cx n) x n X n x n X n X log 2 (n +1)+ P X n(c) n H(P C ). {C} 1 n l n X log 2(n +1) n + {C} P X n(c)h(p C ).

Universal source coding on composition = P X n(c)h(p C ) {C} {C T n (δ)} P X n(c)h(p C )+ {C T n (δ)} max H(P C)+ {C : D(P C P X ) δ/ log(2)} max {C : D(P C P X ) δ/ log(2)} H(P C)+ (From Lemma 8.4) max {C : D(P C P X ) δ/ log(2)} H(P C)+ max H(P C)+ {C : D(P C P X ) δ/ log(2)} P X n(c)h(p C ) {C : D(P C P X )>δ/ log(2)} {C : D(P C P X )>δ/ log(2)} {C : D(P C P X )>δ/ log(2)} {C : D(P C P X )>δ/ log(2)} max H(P C)+(n +1) X e nδ log 2 X, {C : D(P C P X ) δ/ log(2)} P X n(c)h(p C ) 2 nd(p C P X ) H(P C ), e nδ H(P C ) e nδ log 2 X II:8-17 where the second term of the last step vanishes as n. (Note that when base-2 logarithm is taken in divergence instead of natural logarithm, the range [0,δ]in

Universal source coding on composition II:8-18 T n (δ) should be replaced by [0,δ/log(2)].) It remains to show that max {C : D(P C P X ) δ/ log(2)} H(P C) H(X)+γ(δ), where γ(δ) only depends on δ, and approaches zero as δ 0....

Likelihood ratio versus divergence II:8-19 Recall that the Neyman-Pearson lemma indicates that the optimal test for two hypothesis is of the form P X n(x n ) P ˆXn(x n ) > < τ. (8.1.2) This is the likelihood ratio test and the quantity P X n(x n )/P ˆXn(x n ) is called the likelihood ratio. If a log operation is performed on both sides of (8.1.2), the test remains.

Likelihood ratio versus divergence II:8-20 log P X n(xn ) P ˆXn(x n ) = n i=1 = a X = a X = n a X [ = n Hence, (8.1.2) is equivalent to log P X(x i ) P ˆX(x i ) [#a(x n )] log P X(a) P ˆX(a) [np Cx n(a)] log P X(a) P ˆX(a) P Cx n(a)log P X(a) P Cx n(a) P Cx n(a) P ˆX(a) P Cx n(a)log P C x n(a) P ˆX(a) a X a X = n [ D(P Cx n P ˆX) D(P Cx n P X ) ] P Cx n(a)log P ] C x n(a) P X (a) D(P Cx n P ˆX) D(P Cx n P X ) > < 1 n log τ. (8.1.3) This equivalence means that for hypothesis testing, selection of the acceptance region can be made upon compositions instead of observations.

Likelihood ratio versus divergence II:8-21 In other words, the optimal decision function can be defined as: 0, if composition C is classified to belong to null hypothesis φ(c) = according to (8.1.3); 1, otherwise.

Exponent of Bayesian cost II:8-22 Randomization is of no help to Bayesian test. { φ(x n 0, with probability η; ) = 1, with probability 1 η; satisfies π 0 ηp X n(x n )+π 1 (1 η)p ˆXn(x n ) min{π 0 P X n(x n ),π 1 P ˆXn(x n )}. Now suppose the acceptance region for null hypothesis is A ={C : D(P C P ˆX) D(P C P X ) >τ }. Then by Sanov s theorem, the exponent of type II error, β n,is min C A D(P C P ˆX). Similarly, the exponent of type I error, α n is min C A c D(P C P X ).

Exponent of Bayesian cost II:8-23 Lagrange multiplier: by taking derivative of ( ) D(P X P ˆX)+λ(D(P X P ˆX) D(P X P X ) τ )+ν P X(x) 1 with respective to each P X(x), we have x X log P X(x) P +1+λlog P X(x) + ν =0. ˆX(x) P ˆX(x) Solving these equations, we obtain the optimal P X is of the form P λ X P X(x) =P λ (x) = a X 1 λ (x)p ˆX (x) PX(a)P λ 1 λ ˆX (a). The geometrical explanation for P λ is that it locates on the straight line between P X and P ˆX (in the sense of divergence measure) over the probability space.

Exponent of Bayesian cost II:8-24 D(P C P X )=D(P C P ˆX) τ P X D(P X P X ) P X D(P X P ˆX) P ˆX The divergence view on hypothesis testing.

Exponent of Bayesian cost II:8-25 When λ 0, P λ P ˆX; whenλ 1, P λ P X. Usually, P λ is named the tilted or twisted distribution. The value of λ is dependent on τ =(1/n)logτ. It is known from detection theory that the best τ for Bayes testing is π 1 /π 0, which is fixed. Therefore, τ 1 = lim n n log π 1 =0, π 0 which implies that the optimal exponent for Bayes error is the minimum of D(P λ P X ) subject to D(P λ P X )=D(P λ P ˆX), namely the mid-point (λ = 1/2) of the line segment (P X,PˆX) on probability space. This quantity is called the Chernoff bound.

Large deviations theory II:8-26 The large deviations theory basically consider the technique of computing the exponent of an exponentially decayed probability.

Tilted or twisted distribution II:8-27 Suppose the probability of a set P X (A n ) decreasing down to zero exponentially fact, and its exponent is equal to a>0. Over the probability space, let P denote the set of those distributions P X satisfying P X(A n ) exhibits zero exponent. Then applying similar concept as Sanov s theorem, we can expect that a = min P X P D(P X P X ). Now suppose the minimizer of the above function happens at f(p X) =τ for some constant τ and some differentiable function f( ), the minimizer should be of the form ( a X) P X(a) = P X(a)e λ f(p X) P X(a). P X (a )e λ f(p X) P X(a ) As a result, P X is the resultant distribution from P X exponentially twisted via the partial derivative of the function f. Note that P X is usually written as P (λ) X since it is generated by twisting P X with twisted factor λ. a X

Conventional twisted distribution II:8-28 The conventional definition for twisted distribution is based on the divergence function, i.e., f(p X) =D(P X P ˆX) D(P X P X ). Since D(P X P X ) P X(a) the twisted distribution becomes ( a X) P X(a) = = =log P X(a) P X (a) +1, P X (a)e λ log P (a) ˆX P X (a) a X PX 1 λ a X P X (a )e λ log P (a ˆX ) P X (a ) PX 1 α (a)p λˆx(a) (a)p λˆx(a)

Cramer s theorem II:8-29 Question: Consider a sequence of i.i.d. random variables, X n, and suppose that we are interested in the probability of the set { } X1 + + X n >τ. n Observe that (X 1 + + X n )/n can be re-written as a P C (a). a X Therefore, the function f becomes f(p X) = a X ap X(a), and its partial derivative with respect to P X(a) isa. The resultant twisted distribution is ( a X) P (λ) X (a) = P X (a)e λa P X (a )e. λa a X So the exponent of P X n{(x 1 + + X n )/n > τ} is min D(P X P X )= min {P X : D(P X P X )>τ} {P (λ) (λ) X : D(P X P X)>τ} D(P (λ) X P X).

Cramer s theorem II:8-30 It should be pointed out that a X P X(a )e λa is the moment generating function of P X. The conventional Cramer s result does not use the divergence. introduced the large deviation rate function, defined by Instead, it I X (x) =sup [θx log M X (θ)], (8.2.4) θ R where M X (θ) is the moment generating function of X. Using his statement, the exponent of the above probability is respectively lowerand upper bounded by inf I X(x) and inf I X(x). x τ x>τ An example on how to obtain the exponent bounds is illustrated in the next subsection.

Exponent and moment generating function II:8-31 A) Preliminaries : Observe that since E[X] =µ<λand E[ X µ 2 ] <, { } X1 + + X n Pr λ 0asn. n Hence, we can compute its rate of convergence (to zero). B) Upper bound of the probability : { } X1 + + X n Pr λ n = Pr{θ(X 1 + + X n ) θnλ}, for any θ>0 = Pr{exp (θ(x 1 + + X n )) exp (θnλ)} E [exp (θ(x 1 + + X n ))] exp (θnλ) = En [exp (θx))] exp (θnλ) ( ) n MX (θ) =. exp (θλ) Hence, lim inf 1 { } n n Pr X1 + + X n >λ θλ log M X (θ). n

Exponent and moment generating function II:8-32 Since the above inequality holds for every θ>0, we have lim inf 1 { } n n Pr X1 + + X n >λ max n [θλ log M X(θ)] θ>0 = θ λ log M X (θ ), where θ > 0 is the optimizer of the maximum operation. (The positivity of θ can be easily verified by the concavity of the function θλ log M X (θ) in θ, and it derivative at θ =0equals(λ µ) which is strictly greater than 0.) Consequently, lim inf n 1 { n Pr X1 + + X n n } >λ θ λ log M X (θ ) = sup[θλ log M X (θ)] = I X (λ). θ R C) Lower bound of the probability :omit.

Theories on Large deviations II:8-33 In this section, we will derive inequalities on the exponent of the probability, Pr{Z n /n [a, b]}, which is a slight extension of the Gärtner-Ellis theorem.

Extension of Gärtner-Ellis upper bounds II:8-34 Definition 8.8 In this subsection, {Z n } n=1 arbitrary random variables. will denote an infinite sequence of Definition 8.9 Define ϕ n (θ) = 1 n log E [exp {θz n}] and ϕ(θ) = lim sup n ϕ n (θ). The sup-large deviation rate function of an arbitrary random sequence {Z n } n=1 is defined as Ī(x) = sup {θ R : ϕ(θ)> } [θx ϕ(θ)]. (8.3.5) The range of the supremum operation in (8.3.5) is always non-empty since ϕ(0) = 0, i.e. {θ R : ϕ(θ) > }. Hence, Ī(x) is always defined. With the above definition, the first extension theorem of Gärtner-Ellis can be proposed as follows. Theorem 8.10 For a, b Rand a b, { } 1 Zn lim sup log Pr [a, b] inf Ī(x). n n n x [a,b] The bound obtained in the above theorem is not in general tight.

Extension of Gärtner-Ellis upper bounds II:8-35 Example 8.11 Suppose that Pr{Z n =0} =1 e 2n,and Pr{Z n = 2n} = e 2n. Then from Definition 8.9, we have ϕ n (θ) = 1 n log E [ e ] θz n = 1 [ ] n log 1 e 2n + e (θ+1) 2 n, and ϕ(θ) = lim sup n ϕ n (θ) = Hence, {θ R: ϕ(θ) > } = R and { 0, for θ 1; 2(θ +1), for θ< 1. Ī(x) = sup[θx ϕ(θ)] θ R = sup[θx +2(θ +1)1{θ < 1)}] θ R { x, for 2 x 0; =, otherwise, where 1{ } represents the indicator function of a set.

Extension of Gärtner-Ellis upper bounds II:8-36 Consequently, by Theorem 8.10, { } 1 Zn lim sup log Pr [a, b] n n n inf Ī(x) x [a,b] 0, for 0 [a, b]; = b, for b [ 2, 0];, otherwise. The exponent of Pr{Z n /n [a, b]} in the above example is indeed given by { } 1 lim n n log P Zn Z n [a, b] = inf n x [a,b] I (x), where I (x) = 2, for x = 2; 0, for x =0;, otherwise. Thus, the upper bound obtained in Theorem 8.10 is not tight. (8.3.6)

Extension of Gärtner-Ellis upper bounds II:8-37 Definition 8.12 Define ϕ n (θ; h) = 1 n log E [exp { n θ h ( Zn n )}] and ϕ h (θ) = lim sup n ϕ n (θ; h), where h( ) is a given real-valued continuous function. The twisted sup-large deviation rate function of an arbitrary random sequence {Z n } n=1 with respect to a real-valued continuous function h( ) is defined as J h (x) = sup [θ h(x) ϕ h (θ)]. (8.3.7) {θ R : ϕ h (θ)> } Theorem 8.13 Suppose that h( ) is a real-valued continuous function. Then for a, b Rand a b, { } 1 Zn lim sup log Pr [a, b] inf J h (x). n n n x [a,b]

Extension of Gärtner-Ellis upper bounds II:8-38 Example 8.14 Let us, again, investigate the {Z n } n=1 Take h(x) = 1 2 (x +2)2 1. Then from Definition 8.12, we have defined in Example 8.11. ϕ n (θ; h) = 1 n log E [exp {nθh(z n/n)}] = 1 n log [exp {nθ} exp {n(θ 2)} +exp{ n(θ +2)}], and ϕ h (θ) = lim sup n Hence, {θ R : ϕ n (θ; h) = { (θ +2), for θ 1; θ, for θ> 1. ϕ h (θ) > } = R and { 1 2 (x +2)2 +2, for x [ 4, 0];, otherwise. J h (x) =sup [θh(x) ϕ h (θ)] = θ R

Extension of Gärtner-Ellis upper bounds II:8-39 Consequently, by Theorem 8.13, { } 1 Zn lim sup log Pr [a, b] n n n inf J h (x) x [a,b] } (a +2)2 (b +2)2 min {, 2, for 4 a<b 0; 2 2 = 0, for a>0orb< 4;, otherwise. (8.3.8) For b ( 2, 0) and a [ 2 2b 4,b ), the upper bound attained in the previous example is strictly less than that given in Example 8.11, and hence, an improvement is obtained. However, for b ( 2, 0) and a< 2 2b 4, the upper bound in (8.3.8) is actually looser. Accordingly, we combine the two upper bounds from Examples 8.11 and 8.14 to get lim sup n 1 log Pr n { Zn n } [a, b] { max = inf x [a,b] } J h (x), inf Ī(x) x [a,b] 0, for 0 [a, b]; 1 2 (b +2)2 2, for b [ 2, 0];, otherwise.

Extension of Gärtner-Ellis upper bounds II:8-40 Theorem 8.15 For a, b Rand a b, { } 1 Zn lim sup log Pr [a, b] n n n inf x [a,b] J(x), where J(x) =sup h H Jh (x) andh is the set of all real-valued continuous functions. Example 8.16 Let us again study the {Z n } n=1 in Example 8.11 (also in Example 8.14). Suppose c>1. Take h c (x) =c 1 (x + c 2 ) 2 c, where c 1 = c + c 2 1 2 and c 2 = 2 c +1 c +1+ c 1. Then from Definition 8.12, we have ϕ n (θ; h c ) = 1 { ( )}] [exp n log E Zn nθh c n = 1 log [exp {nθ} exp {n(θ 2)} +exp{ n(θ +2)}], n and ϕ hc (θ) = lim sup n ϕ n (θ; h c )= { (θ +2), for θ 1; θ, for θ> 1.

Extension of Gärtner-Ellis upper bounds II:8-41 Hence, {θ R : ϕ hc (θ) > } = R and From Theorem 8.15, J hc (x) = sup[θh c (x) ϕ hc (θ)] θ R { c1 (x + c = 2 ) 2 + c +1, for x [ 2c 2, 0];, otherwise. J(x) =sup h H J h (x) max{lim inf c J hc (x), Ī(x)} = I (x), where I (x) is defined in (8.3.6). Consequently, { } 1 Zn lim sup log Pr [a, b] inf J(x) n n n x [a,b] inf x [a,b] I (x) 0, if 0 [a, b]; = 2, if 2 [a, b] and0 [a, b];, otherwise. and a tight upper bound is finally obtained!

Extension of Gärtner-Ellis upper bounds II:8-42 Definition 8.17 Define ϕ h (θ) = lim inf n ϕ n (θ; h), where ϕ n (θ; h) wasdefined in Definition 8.12. The twisted inf-large deviation rate function of an arbitrary random sequence {Z n } n=1 with respect to a real-valued continuous function h( ) is defined as [ ] J h (x) = sup θ h(x) ϕ h (θ). {θ R : ϕ h (θ)> } Theorem 8.18 For a, b Rand a b, { } 1 Zn lim inf log Pr [a, b] n n n inf x [a,b] J(x), where J(x) =sup h H J h (x) andh is the set of all real-valued continuous functions.

Extension of Gärtner-Ellis lower bounds II:8-43 Hope to know when lim sup n 1 log Pr n { Zn n } (a, b) inf x (a,b) J h (x). (8.3.9) Definition 8.19 Define the sup-gärtner-ellis set with respect to a real-valued continuous function h( ) as D h= D(θ; h) where D(θ; h) = {θ R : ϕ h (θ)> } { x R : lim sup t 0 ϕ h (θ + t) ϕ h (θ) t h(x) lim inf t 0 } ϕ h (θ) ϕ h (θ t). t Let us briefly remark on the sup-gärtner-ellis set defined above. It can be derived that the sup-gärtner-ellis set is reduced to D h= {x R : ϕ h (θ) =h(x)}, {θ R : ϕ h (θ)> } if the derivative ϕ h (θ) exists for all θ.

Extension of Gärtner-Ellis lower bounds II:8-44 Theorem 8.20 Suppose that h( ) is a real-valued continuous function. Then if (a, b) D h, { } 1 Zn lim sup log Pr (a, b) inf J h (x). n n n x (a,b) Example 8.21 Suppose Z n = X 1 + + X n,where{x i } n i=1 are i.i.d. Gaussian random variables with mean 1 and variance 1 if n is even, and with mean 1 and variance 1 if n is odd. Then the exact large deviation rate formula Ī (x) that satisfies for all a<b, { } inf Ī 1 Zn (x) lim sup log Pr [a, b] x [a,b] n n n } 1 log Pr n { Zn lim sup (a, b) inf Ī (x) n n x (a,b) is Ī (x) = ( x 1)2. (8.3.10) 2 Case A: h(x) =x. For the affine h( ), ϕ n (θ) =θ + θ 2 /2whenn is even, and ϕ n (θ) = θ + θ 2 /2

Extension of Gärtner-Ellis lower bounds II:8-45 when n is odd. Hence, ϕ(θ) = θ + θ 2 /2, and ( ) ( ) D h = {v R : v =1+θ} {v R : v = 1+θ} θ>0 {v R : 1 v 1} = (1, ) (, 1). Therefore, Theorem 8.20 cannot be applied to any a and b with (a, b) [ 1, 1]. By deriving Ī(x) =sup{xθ ϕ(θ)} = θ R we obtain for any a (, 1) (1, ), 1 lim lim sup log Pr ε 0 n n lim ε 0 inf x (a ε,a+ε) θ<0 ( x 1) 2, for x > 1; 2 0, for x 1, { Zn n } (a ε, a + ε) Ī(x) = ( a 1)2, 2

Extension of Gärtner-Ellis lower bounds II:8-46 which can be shown tight by Theorem 8.13 (or directly by (8.3.10)). Note that the above inequality does not hold for any a ( 1, 1). To fill the gap, a different h( ) must be employed. Case B: h(x) = x. For n even, [ E [ = E = e nθh(z n/n) ] e nθ Z n/n a na ] e θx+nθa 1 2πn e (x n)2 /(2n) dx + = e nθ(θ 2+2a)/2 na + e nθ(θ+2 2a)/2 na na 1 2πn e [x n(1 θ)]2 /(2n) dx 1 2πn e [x n(1+θ)]2 /(2n) dx e θx nθa 1 2πn e (x n)2 /(2n) dx = e nθ(θ 2+2a)/2 Φ ( (θ + a 1) n ) + e nθ(θ+2 2a)/2 Φ ( (θ a +1) n ), where Φ( ) represents the unit Gaussian cdf.

Extension of Gärtner-Ellis lower bounds II:8-47 Similarly, for n odd, [ ] E e nθh(z n/n) = e nθ(θ+2+2a)/2 Φ ( (θ + a +1) n ) + e nθ(θ 2 2a)/2 Φ ( (θ a 1) n ). Observe that for any b R, Hence, lim n ϕ h (θ) = 1 n log Φ(b n)= 0, for b 0; b2, for b<0. 2 ( a 1)2, for θ< a 1; 2 θ[θ + 2(1 a )], for a 1 θ<0; 2 θ[θ + 2(1 + a )], for θ 0. 2

Extension of Gärtner-Ellis lower bounds II:8-48 Therefore, ( ) D h = {x R : x a = θ +1+ a } θ>0 ( ) {x R : x a = θ +1 a } θ<0 = (,a 1 a ) (a 1+ a,a+1 a ) (a +1+ a, ) and J h (x) = ( x a 1+ a ) 2, for a 1+ a <x<a+1 a ; 2 ( x a 1 a ) 2, for x>a+1+ a or x<a 1 a ; 2 0, otherwise. (8.3.11)

Extension of Gärtner-Ellis lower bounds II:8-49 We then apply Theorem 8.20 to obtain lim ε 0 lim sup n lim ε 0 inf x (a ε,a+ε) 1 log Pr n J h (x) = lim ε 0 (ε 1+ a ) 2 2 { Zn n } (a ε, a + ε) = ( a 1)2. 2 Note that the above lower bound is valid for any a ( 1, 1), and can be shown tight, again, by Theorem 8.13 (or directly by (8.3.10)). Finally, by combining the results of Cases A) and B), the true large deviation rate of {Z n } n 1 is completely characterized.

Extension of Gärtner-Ellis lower bounds II:8-50 Definition 8.22 Define the inf-gärtner-ellis set with respect to a real-valued continuous function h( ) as D h= D(θ; h) where D(θ; h) = Theorem 8.23 (a, b) D h, {θ R : ϕ h (θ)> } { ϕ x R : lim sup h (θ + t) ϕ h (θ) t 0 t h(x) lim inf t 0 } ϕ h (θ) ϕ h (θ t). t Suppose that h( ) is a real-valued continuous function. Then if lim inf n { 1 Zn log Pr n n } (a, b) inf x (a,b) J h(x).

Properties II:8-51 Property 8.24 Let Ī(x) and I(x) be the sup- and inf- large deviation rate functions of an infinite sequence of arbitrary random variables {Z n } n=1, respectively. Denote m n =(1/n)E[Z n ]. Let m = lim sup n m n and m= lim inf n m n.then 1. 2. 3. Ī(x) and I(x) are both convex. Ī(x) is continuous over {x R : Ī(x) < }. Likewise, I(x) is continuous over {x R : I(x) < }. Ī(x) gives its minimum value 0 at m x m. 4. I(x) 0. But I(x) does not necessary give its minimum value at both x = m and x = m.

Properties II:8-52 Property 8.25 Suppose that h( ) is a real-valued continuous function. Let J h (x) and J h (x) be the corresponding twisted sup- and inf- large deviation rate functions, respectively. Denote m n (h) =E[h(Z n /n)]. Let Then m h = lim sup n m n (h) and m h = lim inf n m n(h). 1. Jh (x) 0, with equality holds if m h h(x) m h. 2. J h (x) 0, but J h (x) does not necessary give its minimum value at both x = m h and x = m h.

Probabilitic subexponential behavior II:8-53 subexponential behavior. a n =(1/n)exp{ 2n} and b n =(1/ n)exp{ 2n} have the same exponent, but contain different subexponential terms

Berry-Esseen theorem for compound i.i.d. sequence II:8-54 Berry-Esseen theorem states that the distribution of the sum of independent zero-mean random variables {X i } n i=1, normalized by the standard deviation of the sum, differs from the Gaussian distribution by at most Cr n /s 3 n,where s 2 n and r n are respectively sums of the marginal variances and the marginal absolute third moments, and C is an absolute constant. Specifically, for every a R, { } 1 Pr (X 1 + + X n ) a Φ(a) s n C r n, (8.4.12) where Φ( ) represents the unit Gaussian cdf. The striking feature of this theorem is that the upper bound depends only on the variance and the absolute third moment, and hence, can provide a good asymptotic estimate based on only the first three moments. The absolute constant C is commonly 6. When {X n } n i=1 are identically distributed, in addition to independent, the absolute constant can be reduced to 3, and has been reported to be improved down to 2.05. Definition: compound i.i.d. sequence. The samples that we concern in this section actually consists of two i.i.d. sequences (and, is therefore named compound i.i.d. sequence.) s 3 n

Berry-Esseen theorem for compound i.i.d. sequence II:8-55 Lemma 8.26 (smoothing lemma) Fix the bandlimited filtering function v T (x) = 1 cos(tx) πtx 2 = 2sin2 (Tx/2) πtx 2 = T ( ) [ ( )] Tx f 2π sinc2 = Four 1 Λ. 2π T/(2π) For any cumulative distribution function H( ) on the real line R, ( sup T (x) 1 x R 2 η 6 Tπ 2π h T ) 2π η, 2 where T (t) = [H(t x) Φ(t x)] v T (x)dx, η=sup H(x) Φ(x), x R and 1 cos(x) h(u) = u dx = π u sin(x) u +1 cos(u) u dx, if u 0; u x 2 2 0 x 0, otherwise.

Berry-Esseen theorem for compound i.i.d. sequence II:8-56 Lemma 8.27 For any cumulative distribution function H( ) with characteristic function ϕ H (ζ), η 1 ( T ϕ π H (ζ) e (1/2)ζ2 dζ ζ + 12 Tπ 2π h T ) 2π η, 2 T where η and h( ) are defined in Lemma 8.26. Theorem 8.28 (BE theorem for compound i.i.d. sequences) Let Y n = n i=1 X i be the sum of independent random variables, among which {X i } d i=1 are identically Gaussian distributed, and {X i } n i=d+1 necessarily Gaussian. are identically distributed but not Denote the mean-variance pair of X 1 and X d+1 by (µ, σ 2 )and(ˆµ, ˆσ 2 ), respectively. Define and ρ =E [ X 1 µ 3], ˆρ =E [ X d+1 ˆµ 3] s 2 n = Var[Y n ]=σ 2 d +ˆσ 2 (n d). Also denote the cdf of (Y n E[Y n ])/s n by H n ( ).

Berry-Esseen theorem for compound i.i.d. sequence II:8-57 Then for all y R, H n (y) Φ(y) C n,d 2 π (n d 1) ( 2(n d) 3 2 ) ˆρ ˆσ 2 s n, where C n,d is the unique positive number satisfying π 6 C n,d h(c n,d ) ( ) π 2(n d) 3 2 = 6π 12(n d 1) 2 ( 3 2 ) + 9 3/2 2(11 6 2), n d provided that n d 3.

Berry-Esseen theorem for compound i.i.d. sequence II:8-58 2.5 2 1.5 π 1 6 u h(u) 0.5 0-0.5-1 0 1 2 3 4 5 u Function of (π/6)u h(u).

Berry-Esseen theorem for compound i.i.d. sequence II:8-59 By letting d = 0, the Berry-Esseen inequality for i.i.d. sequences can also be readily obtained from the previous Theorem. Corollary 8.29 (Berry-Esseen theorem for i.i.d. sequence) Let n Y n = i=1 be the sum of independent random variables with common marginal [ distribution. Denote the marginal mean and variance by (ˆµ, ˆσ 2 ). Define ˆρ =E X 1 ˆµ 3]. Also denote the cdf of (Y n nˆµ)/( nˆσ) byh n ( ). Then for all y R, X i H n (y) Φ(y) C n 2(n 1) π ( 2n 3 2 ) ˆρ ˆσ 3 n, where C n is the unique positive solution of ( ) π π 2n 3 2 6 u h(u) = 6π 12(n 1) 2 ( 3 2 ) + 9 3/2 2(11 6 2), n provided that n 3.

Berry-Esseen theorem for compound i.i.d. sequence II:8-60 Let us briefly remark on the previous corollary. We observe from numericals that the quantity 2 (n 1) C n ( ) π 2n 3 2 is decreasing in n, and ranges from 3.628 to 1.627 (cf. The picture in slide II:8-62.) We can upperbound C n by the unique positive solution D n of π π 6 u h(u) = 6π 6 2 ( 3 2 ) + 9 3/2 2(11 6 2), n which is strictly decreasing in n. Hence, 2 (n 1) 2 (n 1) C n ( ) E n = Dn ( ), π 2n 3 2 π 2n 3 2 and the right-hand-side of the above inequality is strictly decreasing (since both D n and (n 1)/(2n 3 2) are decreasing) in n, and ranges from E 3 =4.1911,...,E 9 =2.0363,..., E 100 =1.6833 to E =1.6266. If the property of strictly decreasingness is preferred, one can use D n instead of C n in the Berry-Esseen inequality. Note that both C n and D n converges to 2.8831... as n goes to infinity.

Berry-Esseen theorem for compound i.i.d. sequence II:8-61 Numerical result shows that it lies below 2 when n 9, and is smaller than 1.68 as n 100. In other words, we can upperbound this quantity by 1.68 as n 100, and therefore, establish a better estimate of the original Berry-Esseen constant.

Berry-Esseen theorem for compound i.i.d. sequence II:8-62 3 1.68 2 C n 2 π (n 1) (2n 3 2) 1 0 3 10 20 30 4050 75100 150200 n The Berry-Esseen constant as a function of the sample size n. The sample size n is plotted in log-scale.

Generalized Neyman-Pearson Hypothesis Testing II:8-63 The general expression of the Neyman-Pearson type-ii error exponent subject to a constant bound on the type-i error has been proved for arbitrary observations. In this section, we will state the results in terms of the ε-inf/sup-divergence rates. Theorem 8.30 (Neyman-Pearson type-ii error exponent for a fixed test level) Consider a sequence of random observations which is assumed to have a probability distribution governed by either P X (null hypothesis) or P ˆX (alternative hypothesis). Then, the type-ii error exponent satisfies lim δ δ Dδ (X ˆX) lim sup n 1 n log β n(ε) D ε (X ˆX) lim D δ (X ˆX) lim inf 1 δ ε n n log β n (ε) D ε(x ˆX) where βn(ε) represents the minimum type-ii error probability subject to a fixed type-i error bound ε [0, 1). The general formula for Neyman-Pearson type-ii error exponent subject to an exponential test level has also been proved in terms of the ε-inf/sup-divergence rates.

Generalized Neyman-Pearson Hypothesis Testing II:8-64 Theorem 8.31 (Neyman-Pearson type-ii error exponent for an exponential test level) Fix s (0, 1) and ε [0, 1). It is possible to choose decision regions for a binary hypothesis testing problem with arbitrary datawords of blocklength n, (which are governed by either the null hypothesis distribution P X or the alternative hypothesis distribution P ˆX,) such that lim inf 1 n n log β n D ε ( ˆX (s) ˆX) and lim sup 1 n n log α n or D (1 ε) ( ˆX (s) X), (8.5.13) lim inf 1 n n log β n D ε ( ˆX (s) ˆX) and lim sup 1 n n log α n D (1 ε) ( ˆX (s) X), (8.5.14) where ˆX (s) exhibits the tilted distributions { P (s) defined by dp (s) ˆX n(xn ) = 1 Ω n (s) exp { ˆX n } n=1 } s log dp X n (x n ) dp ˆXn(x n ), dp ˆXn } (x n ) dp ˆXn(x n ). and { Ω n (s) = exp s log dp X n X n dp ˆXn Here, α n and β n are the type-i and type-ii error probabilities, respectively.