Minimax Redundancy for Large Alphabets by Analytic Methods

Minimax Redundancy for Large Alphabets by Analytic Methods Wojciech Szpankowski Purdue University W. Lafayette, IN 47907 March 15, 2012 NSF STC Center for Science of Information CISS, Princeton 2012 Joint work with M. Weinberger

Outline 1. Source Coding: The Redundancy Rate Problem 2. Universal Memoryless Sources (a) Finite Alphabet (b) Unbounded Alphabet 3. Universal Renewal Sources

Source Coding and Redundancy Source coding aims at finding codesc : A {0,1} of the shortest length L(C,x), either on average or for individual sequences. Known Source P : The pointwise and maximal redundancy are: R n (C n,p;x n 1 ) = L(C n,x n 1 ) + logp(xn 1 ) R (C n,p) = max[l(c x n n,x n 1 ) + logp(xn 1 )] 1 where P(x n 1 ) is the probability of xn 1 = x 1 x n. Unknown Source P : Following Davisson, the maximal minimax redundancy R n (S) for a family of sources S is: R n (S) = minsup Cn P S max[l(c x n n,x n 1 ) + logp(xn 1 )]. 1

Maximal Minimax Redundancy R n For the maximal minimax redundancy define Q (x n 1 ) := sup P S P(x n 1 ) y n1 An sup P S P(y n 1 ). the maximum likelihood distribution. Observe that (Shtarkov, 1976): R n (S) = min sup Cn C P S = min Cn C max x n 1 = min Cn C max x n 1 max(l(c n,x n 1 ) + logp(xn 1 )) x n 1 = R GS n (Q ) + log ( ) L(C n,x n 1 ) + sup logp(x n 1 ) P S ( L(Cn,x n 1 ) + logq (x n 1 )) + log y n 1 An sup P S P(y n 1 ) = log y n 1 An sup P S y n 1 An sup P S P(y n 1 ) P(y n 1 ) + O(1). where R GS n (Q ) is the redundancy of the optimal generalized Shannon code. Also, d n (S) = log sup x n 1 An P S P(x n 1 ) := logd n (S).

Learnable Information and Redundancy 1. S := M k = {P θ : θ Θ} set of k-dimensional parameterized distributions. Let ˆθ(x n ) = argmax θ Θ log1/p θ (x n ) be the ML estimator.

Learnable Information and Redundancy 1. S := M k = {P θ : θ Θ} set of k-dimensional parameterized distributions. Let ˆθ(x n ) = argmax θ Θ log1/p θ (x n ) be the ML estimator. C n ( Θ) balls ; log C n ( θ) useful bits θ θ'... 1 n θˆ indistinquishable models 2. Two models, say P θ (x n ) and P θ (x n ) are indistinguishable if the ML estimator ˆθ with high probability declares both models are the same. 3. The number of distinguishable distributions (i.e, ˆθ), C n (Θ), summarizes then learnable information, I(Θ) = log 2 C n (Θ). 4. Consider the following expansion of the Kullback-Leibler (KL) divergence D(Pˆθ P θ ) := E[logPˆθ(X n )] E[logP θ (X n )] 1 2 (θ ˆθ) T I(ˆθ)(θ ˆθ) d 2 I (θ, ˆθ) where I(θ) = {I ij (θ)} ij is the Fisher information matrix and d I (θ, ˆθ) is a rescaled Euclidean distance known as Mahalanobis distance. 5. Balasubramanian proved the number of distinguishable balls C n (Θ) of radius O(1/ n) is asymptotically equal to the minimax redundancy: Learnable Information = logc n (Θ) = inf θ Θ max x n log Pˆθ P θ = R n (Mk )

Outline Update 1. Source Coding: The Redundancy Rate Problem 2. Universal Memoryless Sources (a) Finite Alphabet (b) Unbounded Alphabet 3. Universal Renewal Sources

Maximal Minimax for Memoryless Sources For a memoryless source over the alphabet A = {1, 2,..., m} we have Then P(x n 1 ) = p 1 k1 p m km, k 1 + + k m = n. D n (M 0 ) := x n 1 sup P(x n 1 ) P(x n 1 ) = x n 1 = = sup p k 1 1 pk m m p 1,...,pm k 1 + +km=n k 1 + +km=n ( n k 1,...,k m ) sup p 1,...,pm ( n )( k1 k 1,...,k m n since the (unnormalized) likelihood distribution is sup P(x n P(x n 1 ) 1 ) = sup p k 1 1 pk m m = p 1,...,pm ( k1 n p k 1 1 pk m m ) k1 ( km ) k1 ( km n n ) k m ) k m.

Generating Function for D n (M 0 ) We write D n (M 0 ) = k 1 + +km=n ( n )( ) k1 k1 ( km k 1,...,k m n n ) k m = n! n n k 1 + +km=n k k 1 1 k 1! kkm m k m! Let us introduce a tree-generating function B(z) = k=0 k k k! zk = 1 1 T(z), T(z) = k=1 k k 1 z k k! where T(z) = ze T(z) (= W( z), Lambert s W -function) that enumerates all rooted labeled trees. Let now D m (z) = n=0 z n n n n! D n(m 0 ). Then by the convolution formula D m (z) = [B(z)] m 1.

Asymptotics for FINITE m The function B(z) has an algebraic singularity at z = e 1, and β(z) := B(z/e) = 1 2(1 z) + 1 3 + O( (1 z). By Cauchy s coefficient formula D n (M 0 ) = n! n n[zn ][B(z)] m = 2πn(1 + O(1/n)) 1 2πi β(z) m dz. zn+1 For finite m, the singularity analysis of Flajolet and Odlyzko implies [z n ](1 z) α nα 1 Γ(α), α / {0, 1, 2,...} that finally yields (cf. Clarke & Barron, 1990, W.S., 1998) R n (M 0) = m 1 ( ) ( ) n π log + log 2 2 Γ( m 2 ) ( 3 + m(m 2)(2m + 1) + 36 + Γ(m 2 )m 3Γ( m 2 1 2 ) Γ2 ( m 2 )m2 9Γ 2 ( m 2 1 2 ) ) 2 n 1 n +

Outline Update 1. Source Coding: The Redundancy Rate Problem 2. Universal Memoryless Sources (a) Finite Alphabet (b) Unbounded Alphabet 3. Universal Renewal Sources

Redundancy for LARGE m Now assume that m is unbounded and may vary with n. Then D n,m (M 0 ) = 2πn 1 2πi β(z) m where g(z) = mlnβ(z) (n + 1)lnz. z n+1 dz = 2πn 1 2πi e g(z) dz

Redundancy for LARGE m Now assume that m is unbounded and may vary with n. Then D n,m (M 0 ) = 2πn 1 2πi β(z) m where g(z) = mlnβ(z) (n + 1)lnz. z n+1 dz = 2πn 1 2πi The saddle point z 0 is a solution of g (z 0 ) = 0, that is, e g(z) dz g(z) = g(z 0 ) + 1 2 (z z 0) 2 g (z 0 ) + O(g (z 0 )(z z 0 ) 3 ). Under mild conditions satisfied by our g(z) (e.g., z 0 is real and unique), the saddle point method leads to: D n,m (M 0 ) = for some ρ < 3/2. e g(z 0 ) ( g (1 )) 2π g (z 0 ) (z 0 ) + O, (g (z 0 )) ρ

Saddle Points m = o(n) m = n n = o(m)

Main Results Large Alphabet Theorem 1 (Orlitsky and Santhanam, 2004, and W.S. and Weinberger, 2010). For memoryless sources M 0 over an m-ary alphabet, m as n grows, we have: (i) For m = o(n) R n,m (M 0) = m 1 2 log n m + m 2 loge + mloge 3 m n 1 2 O ( ) m n (ii) For m = αn + l(n), where α is a positive constant and l(n) = o(n), R n,m (M 0) = nlogb α + l(n)logc α log A α + O(l(n) 2 /n) where C α := 0.5 + 0.5 1 + 4/α, A α := C α + 2/α, B α = αc α+2 α e 1 Cα. (iii) For n = o(m) R n,m (M 0) = nlog m n + 3 n 2 2m loge 3 ( ) n 1 2m loge + O n + n3. m 2

Constrained Memoryless Sources Consider the following generalized source M 0 : alphabet is A B, where A = m and B = M; probabilities p 1,...,p m, of A are unknown, while the probabilities q 1,...,q M of B are fixed; define q = q 1 + + q M and p = 1 q. Our goal is to compute the minimax redundancy R n,m,m ( M 0 ). Recall Lemma 1. R n,m,m ( M 0 ) = log(d n,m,m ) + O(1), D n,m,m = 2 d n,m,m. D n,m,m = n k=0 ( n k) p k (1 p) n k D k,mk. where D n,m = R n,m (M 0 ) + O(1) is the minimax redundancy over A as presented in Theorem 1. Proof. It basically follows from D n,m,m = sup x (A B) n P P(x) = sup x (A B) n P P(x).

Binomial Sums Consider the following binomial sum S f (n) = n k=0 ( n k) p k (1 p) n k f(k) In our case, f(k) = D n,m,m = 2 d n,m,m. Case: f(k) = O(n a log b n) = o(e n ): S f (n) = f(np)(1 + O(1/n)) (cf. Jacquet & W.S. (1999), and Flajolet (1999)). Case: f(k) = (α k ) S f (n) (pα + 1 p) n. Case: f(k) = O(k kβ ) S f (n) p n f(n).

Main Results for the Constrained Model M 0 Theorem 2 (W.S and Weinberger, 2010). Write m n for m depending on n. (i) m n = o(n). Assume: (a) m(x) := m x and its derivatives are continuous functions. (b) n := m n+1 m n = O(m (n)), m (n) = O(m/n), m (n) = O(m/n 2 ). If m n = o( n/logn), then R n,m,m = m np 1 2 ( np log m np ) + m np 2 loge 1 2 + O ( ) 1 + m2 n log2 n. mn n Otherwise, R n,m,m = m ( ) np np 2 log + m ( ) np m np 2 loge + O logn + m2 n n n log2. m n (ii) Let m n = αn + l(n), where α is a positive constant and l(n) = o(n). R n,m,m = nlog(b αp + 1 p) log ( A α + O l(n) + 1 ), n where A α and B α are defined in Theorem 1. (iii) Let n=o(m n ) and assume m k /k is a nondecreasing sequence. Then, ( ) ( R n,m,m = nlog pmn n 2 + O + 1 ). n m n n

Outline Update 1. Source Coding: The Redundancy Rate Problem 2. Universal Memoryless Sources 3. Universal Renewal Sources

Renewal Sources (Virtual Large Alphabet) The renewal process R 0 (introduced in 1996 by Csiszár and Shields) defined as follows: Let T 1,T 2... be a sequence of i.i.d. positive-valued random variables with distribution Q(j) = Pr{T i = j}. In a binary renewal sequence the positions of the 1 s are at the renewal epochs T 0,T 0 + T 1,... with runs of zeros of lengths T 1 1, T 2 1,.... For a sequence x n 0 = 10α 1 10 α 2 1 10 αn 10 0 }{{} k define k m as the number of i such that α i = m. Then P(x n 1 ) = [Q(0)]k 0 [Q(1)] k1 [Q(n 1)] k n 1 Pr{T 1 > k }.

Maximal Minimax Redundancy It can be proved that r n+1 1 D n (R 0 ) n m=0 r m r n = n r n,k, r n,k = k=0 I(n,k) ( k )( ) k0 ( ) k0 k1 k1 ( kn 1 k 0 k n 1 k k k ) kn 1 where I(n,k) is is the integer partition of n into k terms, i.e., n = k 0 + 2k 1 + + nk n 1, k = k 0 + + k n 1. But we shall study s n = n k=0 s n,k where s n,k = e k I(n,k) k k 0 k 0! kkn 1 k n 1! since S(z,u) = k,n s n,ku k z n = i=1 β(zi u). ( ) s n = [z n ]S(z,1) = [z n c ]exp 1 z + alog 1 1 z Theorem 4 (Flajolet and W.S., 1998). We have the following asymptotics s n exp (2 cn 78 ) logn + O(1), logr n = 2 5 cn log2 8 logn+1 2 loglogn+o(1).

That s IT