Minimax Redundancy for Large Alphabets by Analytic Methods

Similar documents
Analytic Information Theory: From Shannon to Knuth and Back. Knuth80: Piteaa, Sweden, 2018 Dedicated to Don E. Knuth

Algorithms, Combinatorics, Information, and Beyond. Dedicated to PHILIPPE FLAJOLET

Analytic Information Theory and Beyond

Analytic Algorithmics, Combinatorics, and Information Theory

Facets of Information

A One-to-One Code and Its Anti-Redundancy

Counting Markov Types

Facets of Information

Context tree models for source coding

A Master Theorem for Discrete Divide and Conquer Recurrences

A Master Theorem for Discrete Divide and Conquer Recurrences. Dedicated to PHILIPPE FLAJOLET

Science of Information

Coding on Countably Infinite Alphabets

Variable-to-Variable Codes with Small Redundancy Rates

Analytic Information Theory: Redundancy Rate Problems

A lower bound on compression of unknown alphabets

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

String Complexity. Dedicated to Svante Janson for his 60 Birthday

Considering our result for the sum and product of analytic functions, this means that for (a 0, a 1,..., a N ) C N+1, the polynomial.

7 Asymptotics for Meromorphic Functions

Robustness and duality of maximum entropy and exponential family distributions

Machine learning - HT Maximum Likelihood

Complex Analysis Slide 9: Power Series

Pattern Matching in Constrained Sequences

WE start with a general discussion. Suppose we have

An O(N) Semi-Predictive Universal Encoder via the BWT

Notes by Zvi Rosen. Thanks to Alyssa Palfreyman for supplements.

Pointwise Redundancy in Lossy Data Compression and Universal Lossy Data Compression

Lecture 2: August 31

Quiz 2 Date: Monday, November 21, 2016

Lecture 8: Information Theory and Statistics

Asymptotic Rational Approximation To Pi: Solution of an Unsolved Problem Posed By Herbert Wilf

Analytic Pattern Matching: From DNA to Twitter. AxA Workshop, Venice, 2016 Dedicated to Alberto Apostolico

Information Theory and Hypothesis Testing

Covariance function estimation in Gaussian process regression

arxiv: v4 [cs.it] 17 Oct 2015

Lecture 8: Information Theory and Statistics

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

Minimax Optimal Bayes Mixtures for Memoryless Sources over Large Alphabets

A proof of a partition conjecture of Bateman and Erdős

Analysis Qualifying Exam

Applications of Information Geometry to Hypothesis Testing and Signal Detection

Naïve Bayes classification

Optimal Distributed Detection Strategies for Wireless Sensor Networks

Multiple Choice Tries and Distributed Hash Tables

Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 2012

Model selection in penalized Gaussian graphical models

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

4 Uniform convergence

LOWELL. MICHIGAN, THURSDAY, AUGUST 16, Specialty Co. Sells to Hudson Mfg. Company. Ada Farmer Dies When Boat Tips

Graduate Econometrics I: Maximum Likelihood II

A Note on a Problem Posed by D. E. Knuth on a Satisfiability Recurrence

Priors for the frequentist, consistency beyond Schwartz

Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model.

Combinatorial/probabilistic analysis of a class of search-tree functionals

6.1 Variational representation of f-divergences

NATIONAL UNIVERSITY OF SINGAPORE Department of Mathematics MA4247 Complex Analysis II Lecture Notes Part II

Asymptotically Minimax Regret for Models with Hidden Variables

COLLECTED WORKS OF PHILIPPE FLAJOLET

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Two-Part Codes with Low Worst-Case Redundancies for Distributed Compression of Bernoulli Sequences

Markov Chains and Hidden Markov Models

Course 214 Section 2: Infinite Series Second Semester 2008

Graduate Econometrics I: Maximum Likelihood I

STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY

Information Theory, Statistics, and Decision Trees

Fourth Week: Lectures 10-12

Lecture 35: December The fundamental statistical distances

Section 8: Asymptotic Properties of the MLE

Asymptotic and Exact Poissonized Variance in the Analysis of Random Digital Trees (joint with Hsien-Kuei Hwang and Vytas Zacharovas)

Bernoulli Polynomials

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

Universal Loseless Compression: Context Tree Weighting(CTW)

Large Deviations Performance of Knuth-Yao algorithm for Random Number Generation

Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam. aad. Bayesian Adaptation p. 1/4

arxiv: v1 [math.co] 12 Sep 2018

Theory of Maximum Likelihood Estimation. Konstantin Kashin

MATH 722, COMPLEX ANALYSIS, SPRING 2009 PART 5

1 Assignment 1: Nonlinear dynamics (due September

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n

LARGE DEVIATIONS OF TYPICAL LINEAR FUNCTIONALS ON A CONVEX BODY WITH UNCONDITIONAL BASIS. S. G. Bobkov and F. L. Nazarov. September 25, 2011

The number of inversions in permutations: A saddle point approach

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Minimax lower bounds I

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9

Mathematical Institute, University of Utrecht. The problem of estimating the mean of an observed Gaussian innite-dimensional vector

Entropy power inequality for a family of discrete random variables

Topic 12 Overview of Estimation

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

Asymptotic Statistics-VI. Changliang Zou

Methods of Estimation

16.1 Bounding Capacity with Covering Number

A Representation of Approximate Self- Overlapping Word and Its Application

Real Analysis Problems

21.1 Lower bounds on minimax risk for functional estimation

A new converse in rate-distortion theory

Paul-Eugène Parent. March 12th, Department of Mathematics and Statistics University of Ottawa. MAT 3121: Complex Analysis I

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Variational Inference (11/04/13)

Transcription:

Minimax Redundancy for Large Alphabets by Analytic Methods Wojciech Szpankowski Purdue University W. Lafayette, IN 47907 March 15, 2012 NSF STC Center for Science of Information CISS, Princeton 2012 Joint work with M. Weinberger

Outline 1. Source Coding: The Redundancy Rate Problem 2. Universal Memoryless Sources (a) Finite Alphabet (b) Unbounded Alphabet 3. Universal Renewal Sources

Source Coding and Redundancy Source coding aims at finding codesc : A {0,1} of the shortest length L(C,x), either on average or for individual sequences. Known Source P : The pointwise and maximal redundancy are: R n (C n,p;x n 1 ) = L(C n,x n 1 ) + logp(xn 1 ) R (C n,p) = max[l(c x n n,x n 1 ) + logp(xn 1 )] 1 where P(x n 1 ) is the probability of xn 1 = x 1 x n. Unknown Source P : Following Davisson, the maximal minimax redundancy R n (S) for a family of sources S is: R n (S) = minsup Cn P S max[l(c x n n,x n 1 ) + logp(xn 1 )]. 1

Source Coding and Redundancy Source coding aims at finding codesc : A {0,1} of the shortest length L(C,x), either on average or for individual sequences. Known Source P : The pointwise and maximal redundancy are: R n (C n,p;x n 1 ) = L(C n,x n 1 ) + logp(xn 1 ) R (C n,p) = max[l(c x n n,x n 1 ) + logp(xn 1 )] 1 where P(x n 1 ) is the probability of xn 1 = x 1 x n. Unknown Source P : Following Davisson, the maximal minimax redundancy R n (S) for a family of sources S is: R n (S) = minsup Cn P S max[l(c x n n,x n 1 ) + logp(xn 1 )]. 1 Shtarkov s Bound: d n (S) := log x n 1 An sup P S P(x n 1 ) R n (S) log x n 1 An sup P S P(x n 1 ) +1 } {{ } Dn(S)

Maximal Minimax Redundancy R n For the maximal minimax redundancy define Q (x n 1 ) := sup P S P(x n 1 ) y n1 An sup P S P(y n 1 ). the maximum likelihood distribution. Observe that (Shtarkov, 1976): R n (S) = min sup Cn C P S = min Cn C max x n 1 = min Cn C max x n 1 max(l(c n,x n 1 ) + logp(xn 1 )) x n 1 = R GS n (Q ) + log ( ) L(C n,x n 1 ) + sup logp(x n 1 ) P S ( L(Cn,x n 1 ) + logq (x n 1 )) + log y n 1 An sup P S P(y n 1 ) = log y n 1 An sup P S y n 1 An sup P S P(y n 1 ) P(y n 1 ) + O(1). where R GS n (Q ) is the redundancy of the optimal generalized Shannon code. Also, d n (S) = log sup x n 1 An P S P(x n 1 ) := logd n (S).

Learnable Information and Redundancy 1. S := M k = {P θ : θ Θ} set of k-dimensional parameterized distributions. Let ˆθ(x n ) = argmax θ Θ log1/p θ (x n ) be the ML estimator.

Learnable Information and Redundancy 1. S := M k = {P θ : θ Θ} set of k-dimensional parameterized distributions. Let ˆθ(x n ) = argmax θ Θ log1/p θ (x n ) be the ML estimator. C n ( Θ) balls ; log C n ( θ) useful bits θ θ'... 1 n θˆ indistinquishable models 2. Two models, say P θ (x n ) and P θ (x n ) are indistinguishable if the ML estimator ˆθ with high probability declares both models are the same. 3. The number of distinguishable distributions (i.e, ˆθ), C n (Θ), summarizes then learnable information, I(Θ) = log 2 C n (Θ).

Learnable Information and Redundancy 1. S := M k = {P θ : θ Θ} set of k-dimensional parameterized distributions. Let ˆθ(x n ) = argmax θ Θ log1/p θ (x n ) be the ML estimator. C n ( Θ) balls ; log C n ( θ) useful bits θ θ'... 1 n θˆ indistinquishable models 2. Two models, say P θ (x n ) and P θ (x n ) are indistinguishable if the ML estimator ˆθ with high probability declares both models are the same. 3. The number of distinguishable distributions (i.e, ˆθ), C n (Θ), summarizes then learnable information, I(Θ) = log 2 C n (Θ). 4. Consider the following expansion of the Kullback-Leibler (KL) divergence D(Pˆθ P θ ) := E[logPˆθ(X n )] E[logP θ (X n )] 1 2 (θ ˆθ) T I(ˆθ)(θ ˆθ) d 2 I (θ, ˆθ) where I(θ) = {I ij (θ)} ij is the Fisher information matrix and d I (θ, ˆθ) is a rescaled Euclidean distance known as Mahalanobis distance. 5. Balasubramanian proved the number of distinguishable balls C n (Θ) of radius O(1/ n) is asymptotically equal to the minimax redundancy: Learnable Information = logc n (Θ) = inf θ Θ max x n log Pˆθ P θ = R n (Mk )

Outline Update 1. Source Coding: The Redundancy Rate Problem 2. Universal Memoryless Sources (a) Finite Alphabet (b) Unbounded Alphabet 3. Universal Renewal Sources

Maximal Minimax for Memoryless Sources For a memoryless source over the alphabet A = {1, 2,..., m} we have Then P(x n 1 ) = p 1 k1 p m km, k 1 + + k m = n. D n (M 0 ) := x n 1 sup P(x n 1 ) P(x n 1 ) = x n 1 = = sup p k 1 1 pk m m p 1,...,pm k 1 + +km=n k 1 + +km=n ( n k 1,...,k m ) sup p 1,...,pm ( n )( k1 k 1,...,k m n since the (unnormalized) likelihood distribution is sup P(x n P(x n 1 ) 1 ) = sup p k 1 1 pk m m = p 1,...,pm ( k1 n p k 1 1 pk m m ) k1 ( km ) k1 ( km n n ) k m ) k m.

Generating Function for D n (M 0 ) We write D n (M 0 ) = k 1 + +km=n ( n )( ) k1 k1 ( km k 1,...,k m n n ) k m = n! n n k 1 + +km=n k k 1 1 k 1! kkm m k m! Let us introduce a tree-generating function B(z) = k=0 k k k! zk = 1 1 T(z), T(z) = k=1 k k 1 z k k! where T(z) = ze T(z) (= W( z), Lambert s W -function) that enumerates all rooted labeled trees. Let now D m (z) = n=0 z n n n n! D n(m 0 ). Then by the convolution formula D m (z) = [B(z)] m 1.

Asymptotics for FINITE m The function B(z) has an algebraic singularity at z = e 1, and β(z) := B(z/e) = 1 2(1 z) + 1 3 + O( (1 z). By Cauchy s coefficient formula D n (M 0 ) = n! n n[zn ][B(z)] m = 2πn(1 + O(1/n)) 1 2πi β(z) m dz. zn+1

Asymptotics for FINITE m The function B(z) has an algebraic singularity at z = e 1, and β(z) := B(z/e) = 1 2(1 z) + 1 3 + O( (1 z). By Cauchy s coefficient formula D n (M 0 ) = n! n n[zn ][B(z)] m = 2πn(1 + O(1/n)) 1 2πi β(z) m dz. zn+1 For finite m, the singularity analysis of Flajolet and Odlyzko implies [z n ](1 z) α nα 1 Γ(α), α / {0, 1, 2,...} that finally yields (cf. Clarke & Barron, 1990, W.S., 1998) R n (M 0) = m 1 ( ) ( ) n π log + log 2 2 Γ( m 2 ) ( 3 + m(m 2)(2m + 1) + 36 + Γ(m 2 )m 3Γ( m 2 1 2 ) Γ2 ( m 2 )m2 9Γ 2 ( m 2 1 2 ) ) 2 n 1 n +

Outline Update 1. Source Coding: The Redundancy Rate Problem 2. Universal Memoryless Sources (a) Finite Alphabet (b) Unbounded Alphabet 3. Universal Renewal Sources

Redundancy for LARGE m Now assume that m is unbounded and may vary with n. Then D n,m (M 0 ) = 2πn 1 2πi β(z) m where g(z) = mlnβ(z) (n + 1)lnz. z n+1 dz = 2πn 1 2πi e g(z) dz

Redundancy for LARGE m Now assume that m is unbounded and may vary with n. Then D n,m (M 0 ) = 2πn 1 2πi β(z) m where g(z) = mlnβ(z) (n + 1)lnz. z n+1 dz = 2πn 1 2πi The saddle point z 0 is a solution of g (z 0 ) = 0, that is, e g(z) dz g(z) = g(z 0 ) + 1 2 (z z 0) 2 g (z 0 ) + O(g (z 0 )(z z 0 ) 3 ). Under mild conditions satisfied by our g(z) (e.g., z 0 is real and unique), the saddle point method leads to: D n,m (M 0 ) = for some ρ < 3/2. e g(z 0 ) ( g (1 )) 2π g (z 0 ) (z 0 ) + O, (g (z 0 )) ρ

Saddle Points m = o(n) m = n n = o(m)

Main Results Large Alphabet Theorem 1 (Orlitsky and Santhanam, 2004, and W.S. and Weinberger, 2010). For memoryless sources M 0 over an m-ary alphabet, m as n grows, we have: (i) For m = o(n) R n,m (M 0) = m 1 2 log n m + m 2 loge + mloge 3 m n 1 2 O ( ) m n (ii) For m = αn + l(n), where α is a positive constant and l(n) = o(n), R n,m (M 0) = nlogb α + l(n)logc α log A α + O(l(n) 2 /n) where C α := 0.5 + 0.5 1 + 4/α, A α := C α + 2/α, B α = αc α+2 α e 1 Cα. (iii) For n = o(m) R n,m (M 0) = nlog m n + 3 n 2 2m loge 3 ( ) n 1 2m loge + O n + n3. m 2

Constrained Memoryless Sources Consider the following generalized source M 0 : alphabet is A B, where A = m and B = M; probabilities p 1,...,p m, of A are unknown, while the probabilities q 1,...,q M of B are fixed; define q = q 1 + + q M and p = 1 q. Our goal is to compute the minimax redundancy R n,m,m ( M 0 ). Recall Lemma 1. R n,m,m ( M 0 ) = log(d n,m,m ) + O(1), D n,m,m = 2 d n,m,m. D n,m,m = n k=0 ( n k) p k (1 p) n k D k,mk. where D n,m = R n,m (M 0 ) + O(1) is the minimax redundancy over A as presented in Theorem 1. Proof. It basically follows from D n,m,m = sup x (A B) n P P(x) = sup x (A B) n P P(x).

Binomial Sums Consider the following binomial sum S f (n) = n k=0 ( n k) p k (1 p) n k f(k) In our case, f(k) = D n,m,m = 2 d n,m,m. Case: f(k) = O(n a log b n) = o(e n ): S f (n) = f(np)(1 + O(1/n)) (cf. Jacquet & W.S. (1999), and Flajolet (1999)). Case: f(k) = (α k ) S f (n) (pα + 1 p) n. Case: f(k) = O(k kβ ) S f (n) p n f(n).

Main Results for the Constrained Model M 0 Theorem 2 (W.S and Weinberger, 2010). Write m n for m depending on n. (i) m n = o(n). Assume: (a) m(x) := m x and its derivatives are continuous functions. (b) n := m n+1 m n = O(m (n)), m (n) = O(m/n), m (n) = O(m/n 2 ). If m n = o( n/logn), then R n,m,m = m np 1 2 ( np log m np ) + m np 2 loge 1 2 + O ( ) 1 + m2 n log2 n. mn n Otherwise, R n,m,m = m ( ) np np 2 log + m ( ) np m np 2 loge + O logn + m2 n n n log2. m n (ii) Let m n = αn + l(n), where α is a positive constant and l(n) = o(n). R n,m,m = nlog(b αp + 1 p) log ( A α + O l(n) + 1 ), n where A α and B α are defined in Theorem 1. (iii) Let n=o(m n ) and assume m k /k is a nondecreasing sequence. Then, ( ) ( R n,m,m = nlog pmn n 2 + O + 1 ). n m n n

Outline Update 1. Source Coding: The Redundancy Rate Problem 2. Universal Memoryless Sources 3. Universal Renewal Sources

Renewal Sources (Virtual Large Alphabet) The renewal process R 0 (introduced in 1996 by Csiszár and Shields) defined as follows: Let T 1,T 2... be a sequence of i.i.d. positive-valued random variables with distribution Q(j) = Pr{T i = j}. In a binary renewal sequence the positions of the 1 s are at the renewal epochs T 0,T 0 + T 1,... with runs of zeros of lengths T 1 1, T 2 1,....

Renewal Sources (Virtual Large Alphabet) The renewal process R 0 (introduced in 1996 by Csiszár and Shields) defined as follows: Let T 1,T 2... be a sequence of i.i.d. positive-valued random variables with distribution Q(j) = Pr{T i = j}. In a binary renewal sequence the positions of the 1 s are at the renewal epochs T 0,T 0 + T 1,... with runs of zeros of lengths T 1 1, T 2 1,.... For a sequence x n 0 = 10α 1 10 α 2 1 10 αn 10 0 }{{} k define k m as the number of i such that α i = m. Then P(x n 1 ) = [Q(0)]k 0 [Q(1)] k1 [Q(n 1)] k n 1 Pr{T 1 > k }.

Renewal Sources (Virtual Large Alphabet) The renewal process R 0 (introduced in 1996 by Csiszár and Shields) defined as follows: Let T 1,T 2... be a sequence of i.i.d. positive-valued random variables with distribution Q(j) = Pr{T i = j}. In a binary renewal sequence the positions of the 1 s are at the renewal epochs T 0,T 0 + T 1,... with runs of zeros of lengths T 1 1, T 2 1,.... For a sequence x n 0 = 10α 1 10 α 2 1 10 αn 10 0 }{{} k define k m as the number of i such that α i = m. Then P(x n 1 ) = [Q(0)]k 0 [Q(1)] k1 [Q(n 1)] k n 1 Pr{T 1 > k }. Theorem 3 (Flajolet and W.S., 1998). Consider the class of renewal processes. Then R n (R 0) = 2 cn + O(logn). log2 where c = π2 6 1 0.645.

Maximal Minimax Redundancy It can be proved that r n+1 1 D n (R 0 ) n m=0 r m r n = n r n,k, r n,k = k=0 I(n,k) ( k )( ) k0 ( ) k0 k1 k1 ( kn 1 k 0 k n 1 k k k ) kn 1 where I(n,k) is is the integer partition of n into k terms, i.e., n = k 0 + 2k 1 + + nk n 1, k = k 0 + + k n 1. But we shall study s n = n k=0 s n,k where s n,k = e k I(n,k) k k 0 k 0! kkn 1 k n 1! since S(z,u) = k,n s n,ku k z n = i=1 β(zi u).

Maximal Minimax Redundancy It can be proved that r n+1 1 D n (R 0 ) n m=0 r m r n = n r n,k, r n,k = k=0 I(n,k) ( k )( ) k0 ( ) k0 k1 k1 ( kn 1 k 0 k n 1 k k k ) kn 1 where I(n,k) is is the integer partition of n into k terms, i.e., n = k 0 + 2k 1 + + nk n 1, k = k 0 + + k n 1. But we shall study s n = n k=0 s n,k where s n,k = e k I(n,k) k k 0 k 0! kkn 1 k n 1! since S(z,u) = k,n s n,ku k z n = i=1 β(zi u). ( ) s n = [z n ]S(z,1) = [z n c ]exp 1 z + alog 1 1 z Theorem 4 (Flajolet and W.S., 1998). We have the following asymptotics s n exp (2 cn 78 ) logn + O(1), logr n = 2 5 cn log2 8 logn+1 2 loglogn+o(1).

That s IT