Correlation Detection and an Operational Interpretation of the Rényi Mutual Information Masahito Hayashi 1, Marco Tomamichel 2 1 Graduate School of Mathematics, Nagoya University, and Centre for Quantum Technologies, National University of Singapore 2 School of Physics, The University of Sydney ISIT 2015, Hong Kong (arxiv: 1408.6894)
Outline and Motivation Rényi Entropy and divergence (Rényi 61) have found various applications in information theory: e.g. error exponents for hypothesis testing and channel coding, cryptography, the Honey Do problem, etc. Conditional Rényi entropy and Rényi mutual information are less understood. Mathematical properties of different proposed definitions have recently been investigated see, e.g., Fehr Berens (TIT 14) or Verdú (ITA 15), and many works in quantum We want to find an operational interpretation of the measures. 2 / 21
Mutual Information Two discrete random variables (X, Y ) P XY. Many expression for the mutual information are available: I(X : Y ) = H(X) + H(Y ) H(XY ) (1) Which one to generalize? = H(X) H(X Y ) (2) = D(P XY P X P Y ) (3) = min D(P XY P X ) (4) = min Q X, D(P XY Q X ). (5) 3 / 21
Rényi Mutual Information Two discrete random variables (X, Y ) P XY. Many expression for the mutual information are available: 1 I α (X : Y ) = H α (X) + H α (Y ) H α (XY ) (1) 2 I α (X : Y ) = H α (X) H α (X Y ) (2) 3 I α (X : Y ) = D α (P XY P X P Y ) (3) 4 I α (X : Y ) = min D α (P XY P X ) (4) 5 I α (X : Y ) = min Q X, D α (P XY Q X ). (5) We want the mutual information to be non-negative! We want it to be non-increasing under local processing! 4 / 21
Rényi Mutual Information Two discrete random variables (X, Y ) P XY. Many expression for the mutual information are available: 1 I α (X : Y ) = H α (X) + H α (Y ) H α (XY ) (1) 2 I α (X : Y ) = H α (X) H α (X Y ) (2) 3 I α (X : Y ) = D α (P XY P X P Y ) (3) 4 I α (X : Y ) = min D α (P XY P X ) (4) 5 I α (X : Y ) = min Q X, D α (P XY Q X ). (5) We want the mutual information to be non-negative! We want it to be non-increasing under local processing! 5 / 21
Rényi Mutual Information Two discrete random variables (X, Y ) P XY. Many expression for the mutual information are available: 1 I α (X : Y ) = H α (X) + H α (Y ) H α (XY ) (1) 2 I α (X : Y ) = H α (X) H α (X Y ) (2) 3 I α (X : Y ) = D α (P XY P X P Y ) (3) 4 I α (X : Y ) = min D α (P XY P X ) (4) 5 I α (X : Y ) = min Q X, D α (P XY Q X ). (5) We want the mutual information to be non-negative! We want it to be non-increasing under local processing! 6 / 21
Rényi Mutual Information Two discrete random variables (X, Y ) P XY. Many expression for the mutual information are available: 1 I α (X : Y ) = H α (X) + H α (Y ) H α (XY ) (1) 2 I α (X : Y ) = H α (X) H α (X Y ) (2) 3 I α (X : Y ) = D α (P XY P X P Y ) (3) I α (X : Y ) = min D α (P XY P X ) (4) 5 I α (X : Y ) = min Q X, D α (P XY Q X ). (5) We want the mutual information to be non-negative! We want it to be non-increasing under local processing! This is Sibson s proposal. 7 / 21
Rényi Entropy and Divergence For two pmf s P X Q X, the Rényi divergence is defined as ( ) D α (P X Q X ) = 1 α 1 log P X (x) α Q X (x) 1 α. for any α (0, 1) (1, ) and as a limit for α {0, 1, }. Monotonicity: for α β, we have D α (P X Q X ) D β (P X Q X ). x Kullback-Leibler divergence: lim α 1 D α(p X Q X ) = D(P X Q X ) = x P X (x) log P X(x) Q X (x). Data-processing inequality (DPI): for any channel W, we have D α (P X Q X ) D α (P X W Q X W ). 8 / 21
Rényi Mutual Information Recall: I α (X : Y ) = min D α (P XY P X ) Inherits monotonicity and DPI from divergence. We have lim α 1 I α (X : Y ) = I(X : Y ). Sibson s identity (Sibson 69): minimizer satisfies (y) α P X (x)p Y X (y x) α, x I α (X : Y ) = 1 α 1 log ( ) 1 α P X (x)p Y X (y x) α. y x Additivity: (X 1, X 2, Y 1, Y 2 ) P X1 Y 1 P X2 Y 2 independent: I α (X 1 X 2 : Y 1 Y 2 ) = I α (X 1 : Y 1 ) + I α (X 2 : Y 2 ). 9 / 21
Correlation Detection and One-Shot Converse Correlation Detection: given a pmf P XY, consider Null Hypothesis: (X, Y ) P XY Alternative Hypothesis: X P X independent of Y For a test T Z XY with Z {0, 1} define errors α(t ) = Pr[Z = 1], (X, Y, Z) P XY T Z XY β(t ) = max Pr[Z = 0], (X, Y, Z) P X T Z XY The one-shot (meta-) converse can be stated in terms of this composite hypothesis testing problem (Polyanskiy 13). Any code on W Y X with input distribution P X using M codewords and average error ε satisfies (P XY = P X W Y X ): M 1 ˆβ(ε), ˆβ(ε) = min { β(t ) T Z XY s.t. α(t ) ε }. 10 / 21
Asymptotic Correlation Detection Consider the asymptotics n for the sequence of problems Null Hypothesis: (X n, Y n ) P n XY Alternative Hypothesis: X n P n X independent of Y n For a test T n Z X n Y n with Z {0, 1} define errors α(t n ) = Pr[Z = 1], (X, Y, Z) P n XY T n Z X n Y n β(t n ) = max Pr[Z = 0], (X, Y, Z) P n Q X n T Z X n n Y n Y n Define minimal error for fixed rate R > 0: ˆα(R; n) = min { α(t n ) T n Z X n Y n s.t. β(t n ) exp( nr) }. 11 / 21
Error Exponents (Hoeffding) Recall: I s (X : Y ) = min D s (P XY P X ) ˆα(R; n) = min { α(t n ) T n Z X n Y n s.t. β(t n ) exp( nr) } Result (Error Exponent) For any R > 0, we have { 1n } log ˆα(R; n) lim n = sup s (0,1) { 1 s ( Is (X : Y ) R )}. s If R I(X : Y ) it evaluates to 0, else it is positive. I(X : Y ) is the critical rate (cf. Stein s Lemma). If R < I 0 (X : Y ) it diverges to +. This is the zero-error regime. 12 / 21
Strong Converse Exponents (Han Kobayashi) Recall: I s (X : Y ) = min D s (P XY P X ) ˆα(R; n) = min { α(t n ) T n Z X n Y n s.t. β(t n ) exp( nr) } Result (Strong Converse Exponent) For any 0 < R < I (X : Y ), we have { lim 1 n n log ( 1 ˆα(R; n) )} { s 1 ( = sup R Is (X : Y ) )}. s>1 s If R I(X : Y ) it evaluates to 0, otherwise it is positive. This implies the strong converse to Stein s Lemma. What if R = I(X : Y )? 13 / 21
Second Order Expansion For small deviations r from the rate R, define ˆα(R, r; n) = min { α(t n ) T n Z X n Y n s.t. β(t n ) exp( nr nr) }. Result (Second Order Expansion) For any r R, we have lim ˆα( I(X : Y ), r; n ) = Φ n ( ) r. V (X : Y ) Φ is cumulative (normal) Gaussian distribution function. V (X : Y ) = V (P XY P X P Y ) where V ( ) is the divergence variance d ds s=1 I s (X : Y ) = 1 2 V (X : Y ). 14 / 21
Universal Distribution For every n, consider the universal pmf (Hayashi 09) T n Y n(yn ) = λ P n(y ) 1 P n (Y ) U λ(y n ), where U λ is the uniform distribution over the type class λ. Every S n -invariant pmf n satisfies n(y n ) P n (Y ) T n Y n(yn ) y n. Main idea: test P n XY vs. P n X T n Y n. Lemma For any joint pmf P XY, the universal pmf satisfies D α ( P n XY P n X T n Y n ) = niα (X : Y ) + O(log n). 15 / 21
Error Exponent: Achievability (1) Fix s (0, 1). Fix sequence {λ n } n to be chosen later. We use Neyman-Pearson tests for P n n XY vs. P X T Y n n: { Z(x n, y n P n XY ) = 1 log (xn, y n } ) P n X (xn )TY n (y n ) λ n. n Then, with (X n, Y n ) P n XY, we have Pr[Z = 1] = x n,y n P n { XY (x n, y n ) 1 log P n XY (xn, y n ) P n X (xn )TY n n(yn ) λn exp ( ) ( (1 s)λ n P n XY (x n, y n ) ) s( P n X (x n )TY n n(yn ) ) 1 s x n,y ( n = exp (1 s) ( λ n D s(p n XY P n X TY n n))). } 16 / 21
Error Exponent: Achievability (2) And, with (X n, Y n ) P n X n, we have Pr[Z = 0] = { P X n(x n ) n(y n P n XY )1 log (xn, y n } ) P n x n,y n X (xn )TY n n(yn ) < λn = { P X n(x n ) n(y n P n XY )1 log (xn, y n } ) P n x n,y n X (xn )TY n n(yn ) < λn. where n(y n ) = π S n 1 S n n(p (π)yn ) is S n -invariant. Now we can bring in the universal pmf again: Pr[Z = 0] P n(y ) { P X n(x n )TY n n(yn )1 log x n,y n P n(y ) exp P n XY (xn, y n ) P n X (xn )TY n n(yn ) < λn ( sλ n (1 s)d s(p n XY P n X T n Y n) ). Choose {λ n } such that Pr[Z = 0] exp( nr). } 17 / 21
Second Order: Achievability There exits {λ n } n such that Pr[Z = 0] exp ( ni(x : Y ) nr ) (X n, Y n ) P n X Pr[Z = 1] = Pr[F n(x n, Y n) < r] (X n, Y n ) P n XY. with a new sequence of random variables F n(x n, Y n) = 1 ( log n n P n XY (Xn, Y n ) P n X (Xn )TY n n(y nr log Pn(Y ) n ) Asymptotic cumulant generating function: Λ F (t) = lim log E[exp(tF n)] n t = lim (D n n 1+ t (P n n XY = t2 2 V (P XY P X P Y ). ). P n X T n Y n) ni(x : Y ) ) F n converges in distribution to a Gaussian F with variance V (by a variation of Lévi s continuity theorem). 18 / 21
Quantum Hypothesis Testing Given a bipartite quantum state ρ AB, consider Null Hypothesis: state is ρ AB Alternative Hypothesis: state is ρ A σ B for some state σ B Using the same notation: lim { 1n } { 1 s log ˆα(R; n) = sup (Īs (A : B) R )}, n s (0,1) s { lim 1 n n log ( 1 α(r; n) )} { s 1 ( = sup R Ĩ s (A : B) )}. s>1 s The definition are similar, } Ī s (A : B) Ĩ s (A : B) = min σ B But D s and D s are different! { Ds (ρ AB ρ A σ B ) D s (ρ AB ρ A σ B ). 19 / 21
Two Quantum Rényi Divergences D(ρ σ) D s (ρ σ) D s (ρ σ) s 0.0 0.5 1.0 2.0 3.0 They agree with the classical quantity for commuting states. D s (ρ σ) = 1 s 1 log tr ( ρ s σ 1 s), D s (ρ σ) = 1 s 1 log tr (( σ 1 s 2s 1 s ρσ 2s ) s ). 20 / 21
Summary and Outlook Correlation detection gives operational meaning to I α (X : Y ) = min D α (P XY P X ). Similarly Arimoto s conditional Rényi entropy H α (X Y ) = log X min D α (P XY U X ). has an operational interpretation: Null Hypothesis: (X, Y ) P XY Alternative Hypothesis: X U X uniform and indep. of Y Does the symmetric mutual information I α(x : Y ) = min D α (P XY Q X ) Q X, have a natural operational interpretation? 21 / 21