On Universal Types. Gadiel Seroussi Hewlett-Packard Laboratories Palo Alto, California, USA. University of Minnesota, September 14, 2004

Size: px

Start display at page:

Download "On Universal Types. Gadiel Seroussi Hewlett-Packard Laboratories Palo Alto, California, USA. University of Minnesota, September 14, 2004"

Muriel Tate
5 years ago
Views:

1 On Universal Types Gadiel Seroussi Hewlett-Packard Laboratories Palo Alto, California, USA University of Minnesota, September 14, 2004

2 Types for Parametric Probability Distributions A = finite alphabet, sequence or string x n = x 1 x 2... x n A n ( n 0 ), x k j = x jx j+1... x k sub-string Parametric family P = { P Θ } of distributions on A n, P Θ : A n [0, 1], Θ D R K Type class of x n with respect to P : T P x n = { yn A n : P Θ (y n ) = P Θ (x n ) Θ D } we say that y n and x n are of the same type 1

3 Types for Parametric Probability Distributions (cont.) Type class of x n with respect to P : T P x n = { yn A n : P Θ (y n ) = P Θ (x n ) Θ D } Example: A = {0, 1}, P = {Bernoulli(θ) 0 θ 1} (1-parameter family): x n and y n are of the same type they have the same count of 1 s Example: k -th order binary Markov: 2 k parameters P (x i = 1 x i 1 i k )... same k+1 st order joint empirical distributions (with appropriate initial state assumptions) The method of types is classical in information theory [Csiszár & Körner 81, Csiszár 98,...] 2

4 The Lempel-Ziv incremental parsing Parse x n into phrases p 0 = λ, the null string x n = p 0 p 1 p 2... p c 1 t x p i, i > 0, is the shortest substring of x n, starting at the position following p i 1, such that p i p j, j < i (all p i s are different) every phrase p i, except p 0 = λ, is an extension of another phrase t x is the remaining tail of x n where parsing is truncated due to the end of the sequence (must be equal to one of the p i s) LZ78 universal data compression algorithm [Ziv & Lempel, 1978] has code length L LZ = c log c + O(c) Example: x 8 = λ, 1, 0, 1 0, 1 1, 0 0 c = 6, t x = λ λ T 10 1 tree representation: nodes phrases 11 3

5 Universal type classes T x n = {p 0, p 1,..., p c 1 } set of phrases of x n Universal (LZ) type class of x n : U x n = { y n A n : T y n = T x n } all strings of length n that produce the same set of phrases as x n same phrases, generally in different order phrase permutation must respect prefix relation Example: x 8 = , 0, 10, 11, 00 0, 00, 1, 11, = y 8 4

6 Why are universal type classes interesting? N(u k, v m ) = number of (possibly overlapping) occurrences of u k in v m. ˆP (k) x n (u k ) = N(u k, x n )/(n k + 1), u k A k Empirical joint distribution of order k n Theorem. Let x n be an arbitrary sequence of length n, and k a fixed positive integer. If y n U x n, then, for all u k A k, ˆP (k) x n (u k ) ˆP (k) y n (u k ) 0 as n. (empirical distributions of any order k converge in the variational sense) Asymptotics: an infinite sequence of sequences x n, not necessarily prefix-related Proof arguments: any u k is either contained in a phrase, or spans a phrase boundary ( O(kc) places), or contained in the tail ( t x places) well known LZ properties: c n/(log n o(log n)), p i 2n o( n) Convergence: O(1/ log n) 5

7 Finite memory probability assignments A k th order (finite-memory) probability assignment on A n is defined by a set of conditional distributions Q k (u k+1 u k ), u k+1 A k+1 a distribution Q k (u k ) on the initial state so that Q k (x n ) = Q k (x k ) n i=k+1 Q k(x i x i 1 i k ) In particular, Q k could be defined by the k th order approximation of an ergodic measure Corollary. Let x n and y n be sequences of the same universal type. Then, for any nonnegative integer k and any k -th order probability assignment Q k s.t. Q k (x n ), Q k (y n ) 0, we have 1 n log Q k(x n ) Q k (y n ) 0 as n 6

8 Conventional vs. Universal types Sequences of the same... conventional type T P x n have identical empirical distributions of a given structure universal type U x n have converging empirical distributions of any finite order are assigned identical are assigned probabilities probabilities whose normalized logs converge, by assignments from for finite-memory assignments a given parametric class of any finite order degree of indistinguishability vs. universality 7

9 Other things we like to do with type classes Ask how many sequences in the type class of x n (conventional: 2 nĥ(xn )+o(n) ) Ask how many types for given seq. length n (conventional: n K ) Draw a sequence at random with uniform probability from the class Enumerate the type class Use the class for (enumerative) coding and universal simulation 8

10 The tree is the type (almost) Assumptions For simplicity, A = {0, 1} (everything carries to A > 2 ) We ignore effects of tails (nuisance, does not affect main results) When in doubt, log is to base 2 Parsing tree c = number of nodes n = i p i = path length of T l = number of leaves h = height = length of longest path T λ c = 6, n = 8, l = 3, h = 2 1 h All sequences in U x n yield the same tree: the type is T (or (T, n) ) a combinatorial characterization, not explicitly based on counts 9

Size of the universal type class Phrases coming from T 0 and T 1 do not have prefix relation constraints, and can be interleaved arbitrarily Let U T = type class associated with a tree T.

11 Size of the universal type class Phrases coming from T 0 and T 1 do not have prefix relation constraints, and can be interleaved arbitrarily Let U T = type class associated with a tree T. Then U T = U T0 U T1 c 1 c 0 note: c 1 = c 0 + c 1 cases where a sub-tree is missing are easily accounted for (deficient trees) Unroll recursion, telescope Let c pi = size of sub-tree rooted at p i. Then U T = (c 1)! i>0 c pi 10

12 Size of the universal type class example λ x 8 = , c = 6 c 0 = 2, c 00 = 1, c 1 = 3, c 10 = c 11 = 1 U T = 5! 2 3 = 20 0, 00, 1, 10, 11 0, 00, 1, 11, 10 0, 1, 00, 10, 11 0, 1, 00, 11, 10 0, 1, 10, 00, 11 0, 1, 11, 00, 10 0, 1, 10, 11, 00 0, 1, 11, 10, 00 1, 0, 00, 10, 11 1, 0, 00, 11, 10 1, 0, 10, 00, 11 1, 0, 11, 00, 10 1, 0, 10, 11, 00 1, 0, 11, 10, 00 1, 10, 0, 00, 11 1, 11, 0, 11, 10 1, 10, 0, 11, 00 1, 11, 0, 10, 00 1, 10, 11, 0, 00 1, 10, 11, 0, 00 11

13 Asymptotic size of a universal type class Theorem. The size of U x n 1 n ( satisfies log U x n L LZ (x n ) ) 0 as n Analogous to conventional types: 1 n ( log T P x n Ĥ(xn ) ) 0 as n with L LZ (x n ) c log c in lieu of the empirical entropy Ĥ(xn ) (entropy of ML distribution) 12

14 Asymptotic size of a universal type class (zoom-in) Theorem. The size of U T satisfies (1 β o(1)) c log c log U T (1 η o(1)) c log c, where 0 β, η 1, and β is bounded away from 0 iff η is bounded away from 0 iff log n log c is bounded away from 1, log n log c 2. Notes: we have 1. log n log c. 2 (up to lower order terms) log n log c bounded away from 1 c log c = o(n) 1 n L LZ 0 more detail: β = ( 1 l c ) ( log n log c 1 ), η = (h+1) log(h+1) c log c 13

15 Asymptotic size of the universal type: proof outline U T = (c 1)! i>0 c p i = (c 1)! D need to estimate D = i>0 c p i (hard?) let s estimate the sum instead S = i>0 c pi = n easy!!! if p i is a leaf, then c pi = 1. Rewrite sum and product: S = c pi = c pi >1 c pi = n l ; D = i>0 D = product of positive integers whose sum is known c pi >1 c pi D ( ) c l 1 n l the rest follows c l 1 14

c log c + O(c) n = c = 2 m+1 2 m i2 i = (m 1)2 m+1 + 2 i=1 log c log n 1, = c

16 Examples Skinny full tree Fat tree c = 2k + 1, n = k(k + 1) = c2 1, 4 log c log n 1 2, log U T = log = 1 2 ( ) 2 k k! c log c + O(c) n = c = 2 m+1 2 m i2 i = (m 1)2 m i=1 log c log n 1, = c log c + O(c) log U T = c log c + O(c) 0 (1 β) 1 2 attained with deficient trees, e.g.: totally skinny log U T = 0 15

17 Coding with the universal type Enumerative coding [Cover 1973]: encode x n by describing its type + index of the sequence within the type For universal types Description of the parsing tree: 2c + o(c) bits Index within the type: log U x n In some cases, total description length can be shorter than LZ78 code length, e.g. for skinny full tree description length = 1 2 c log c + O(c) possible only when 1 n L LZ 0 16

18 How many universal types? Number of types for sequences of length n = number of LZ78 parsing dictionaries = number of binary trees of given path length n Known (hard?) open combinatorial problem b c,n = number of trees with c nodes and path length = n B(w, z) = c,n b c,nw n z c generating function satisfies zb(w, wz) 2 = B(w, z) 1 (we need B(w, 1) ) Airy phenomenon, Brownian motion, analysis of quicksort,... 17

19 How many universal types? Using information-theoretic/combinatorial arguments Theorem. Let τ n = log ( number of types for sequences of length n). Then, τ n = 2n log n (1 + o(1)) Sloane s OEIS #A Number of binary trees for given path length G. Seroussi, On the number of t-ary trees with a given path length, 18

20 How many universal types? Using information-theoretic/combinatorial arguments Theorem. Let t = A and τ n = log t ( number of types for sequences of length n). Then, τ n = t H 2(t 1 ) n log n (1 + o(1)) Sloane s OEIS #A Number of binary trees for given path length G. Seroussi, On the number of t-ary trees with a given path length, 19

21 Bounds Upper bound. τ n 2n log n (1 + o(1)) a binary tree with c nodes can be encoded using 2c bits a tree with path length n has at most c = 2n log n O(log log n) n log n O(log log n) nodes bits are sufficient to (losslessly) encode all trees of path length n Lower bound. τ n 2n log n (1 o(1)) proof by building a large class of trees of the same path length 2 m 1 nodes at depth m m 1 l 1 different sub-trees of size m 2 (l 1)! permutations (l 1)! different trees of same path length 20

22 Drawing a random sequence from the universal type Mark phrases as unused or used (initially all unused except the root λ ) Pick next phrase by starting from the root and following branches to an unused phrase When at a bifurcation, choose branch randomly with probabilities proportional to number of unused nodes in sub-tree Example: 7λ At root: choose branch with Prob(0) = 2 5, Prob(1) = 3 5 Choose 1 P = 3 5 y = 1 21

23 Drawing a random sequence from the universal type Mark phrases as unused or used (initially all unused except the root λ ) Pick next phrase by starting from the root and following branches to an unused phrase When at a bifurcation, choose branch randomly with probabilities proportional to number of unused nodes in sub-tree Example: 7λ At root: choose branch with Prob(0) = 1 2, Prob(1) = 1 2 Choose 1 P = reached a used node, keep going 22

24 Drawing a random sequence from the universal type Mark phrases as unused or used (initially all unused except the root λ ) Pick next phrase by starting from the root and following branches to an unused phrase When at a bifurcation, choose branch randomly with probabilities proportional to number of unused nodes in sub-tree Example: 7λ At 1 : choose branch with Prob(0) = 1 2, Prob(1) = 1 2 Choose 0 P = y =

25 Drawing a random sequence from the universal type Mark phrases as unused or used (initially all unused except the root λ ) Pick next phrase by starting from the root and following branches to an unused phrase When at a bifurcation, choose branch randomly with probabilities proportional to number of unused nodes in sub-tree Example: 7λ At root: choose branch with Prob(0) = 2 3, Prob(1) = 1 3 Choose 0 P = y =

26 Drawing a random sequence from the universal type Mark phrases as unused or used (initially all unused except the root λ ) Pick next phrase by starting from the root and following branches to an unused phrase When at a bifurcation, choose branch randomly with probabilities proportional to number of unused nodes in sub-tree Example: 7λ At root: choose branch with Prob(0) = 1 2, Prob(1) = 1 2 Choose 0 (rest is deterministic) P = = 1 20 = 1 U T y =

27 Drawing a random sequence from the universal type Proposition. The procedure outlined (i) produces a sequence y n U x n with uniform probability (ii) requires an average of log U x n + O(1) random bits to do so (i.e., it is entropy-efficient ) (ii) is accomplished by letting an arithmetic decoder, fed with a Bernoulli( 1 2 ) sequence and the sequence of bifurcation probabilities, generate the random choices called for by the procedure [Han-Hoshi 97] 26

28 Universal simulation of individual sequences Given an individual sequence x n, we wish 1. to produce a sequence y n that is statistically similar to x n 2. to let y n have as much uncertainty (entropy) as possible given the previous requirement Proposed solution: choose y n randomly and uniformly from U x n 1. similarity = convergence in variation of empirical distributions of any order 2. asymptotically optimal in the following sense: any simulation scheme satisfying (1) must have H(y n ) log U x n + > 0 and almost all x n (doubly-universal analogue of [Merhav-Weinberger 04]) 27

29 Example: Simulation of a binary texture Simulation of a binary texture ( out of , scanned with Peano scan) 28

30 Example: Simulation of a binary texture Simulation of a binary texture ( out of , scanned with Peano scan) 29

31 Additional results and research questions Explicit enumeration of the universal type. Given x n, efficiently implement the map {1, 2, 3,..., U x n } U x n provides an alternative way to draw uniformly from the type class trade-off: fixed number of random bits with approximate uniform probability vs. variable number of random bits with exact uniform probability allows implementation of enumerative code Ongoing research: study universal types for other universal compression algorithms (e.g., Rissanen s Context) no problems in principle, but combinatorial analysis seems more challenging than LZ 30

On universal types. Gadiel Seroussi Information Theory Research HP Laboratories Palo Alto HPL September 6, 2004*

On universal types. Gadiel Seroussi Information Theory Research HP Laboratories Palo Alto HPL September 6, 2004* On universal types Gadiel Seroussi Information Theory Research HP Laboratories Palo Alto HPL-2004-153 September 6, 2004* E-mail: gadiel.seroussi@hp.com method of types, type classes, Lempel-Ziv coding,