Nearly Maximally Predictive Features and Their Dimensions

Size: px

Start display at page:

Download "Nearly Maximally Predictive Features and Their Dimensions"

Gilbert Gibson
5 years ago
Views:

1 Nearl Maximall Predictive Features and Their Dimensions Sarah E. Marzen James P. Crutchfield SFI WORKING PAPER: SFI Working Papers contain accounts of scienti5ic work of the author(s) and do not necessaril represent the views of the Santa Fe Institute. We accept papers intended for publication in peer-reviewed journals or proceedings volumes, but not papers that have alread appeared in print. Except for papers b our external facult, papers must be based on work done at SFI, inspired b an invited visit to or collaboration at SFI, or funded b an SFI grant. NOTICE: This working paper is included b permission of the contributing author(s) as a means to ensure timel distribution of the scholarl and technical work on a non-commercial basis. Copright and all rights therein are maintained b the author(s). It is understood that all persons coping this information will adhere to the terms and constraints invoked b each author's copright. These works ma be reposted onl with the explicit permission of the copright holder. SANTA FE INSTITUTE

2 Santa Fe Institute Working Paper 17-0-XXX arxiv.org:170.xxxx [phsics.gen-ph] Nearl Maximall Predictive Features and Their Dimensions Sarah E. Marzen 1,, and James P. Crutchfield 3, 1 Phsics of Living Sstems Group, Department of Phsics, Massachusetts Institute of Technolog, Cambridge, MA 0139 Department of Phsics, Universit of California at Berkele, Berkele, CA Complexit Sciences Center, Department of Phsics Universit of California at Davis, One Shields Avenue, Davis, CA (Dated: Februar 7, 017) Scientific explanation often requires inferring maximall predictive features from a given data set. Unfortunatel, the collection of minimal maximall predictive features for most stochastic processes is uncountabl infinite. In such cases, one compromises and instead seeks nearl maximall predictive features. Here, we derive upper-bounds on the rates at which the number and the coding cost of nearl maximall predictive features scales with desired predictive power. The rates are determined b the fractal dimensions of a process mixed-state distribution. These results, in turn, show how widel-used finite-order Markov models can fail as predictors and that mixed-state predictive features offer a substantial improvement. PACS numbers: r c Tp 0.50.Ga Kewords: hidden Markov models, entrop rate, dimension, resource-prediction trade-off, mixed-state simplex Often, we wish to find a minimal maximall predictive model consistent with available data. Perhaps we are designing interactive agents that reap greater rewards b developing a predictive model of their environment [1 6] or, perhaps, we wish to build a predictive model of experimental data because we believe that the resultant model gives insight into the underling mechanisms of the sstem [7, 8]. Either wa, we are almost alwas faced with constraints that force us to efficientl compress our data [9]. Ideall, we would compress information about the past without sacrificing an predictive power. For stochastic processes generated b finite unifilar hidden Markov models (HMMs), one need onl store a finite number of predictive features. The minimal such features are called causal states, their coding cost is the statistical complexit C µ [10], and the implied unifilar HMM is the ɛ-machine [10, 11]. However, most processes require an infinite number of causal states [7] and so cannot be described b finite unifilar HMMs. In these cases, we can onl attain some maximal level of predictive power given constraints on the number of predictive features or their coding cost. Equivalentl, from finite data we can onl infer a finite predictive model. Thus, we need to know how our predictive power grows with available resources. Recent work elucidated the tradeoffs between resource constraints and predictive power for stochastic processes generated b countable unifilar HMMs or, equivalentl, semarzen@mit.edu chaos@ucdavis.edu described b a finite or countabl infinite number of causal states [1 14]. Few, though, studied this tradeoff or provided bounds thereof more generall. Here, we place new bounds on resource-prediction tradeoffs in the limit of nearl maximal predictive power for processes with either a countable or an uncountable infinit of causal states b coarse-graining the mixed-state simplex [15]. These bounds give a novel operational interpretation to the fractal dimension of the mixed-state simplex and suggest new routes towards quantifing the memor stored in a stochastic process when, as is tpical, statistical complexit diverges. Background We consider a discrete-time, discretestate stochastic process P generated b an HMM G, which comes equipped with underling states g and labeled transition matrices Tg,g x = Pr(G t+1 = g, X t+1 = x G t = g) [16]. There is an infinite number of alternate HMMs that generate P [17 0], so we specif here that G is the minimal generative model that is, the generative model with the minimal number of hidden states consistent with the observed process [1]. For reasons that become clear shortl, we are interested in the block entrop H(L) = H[X 0:L ], where X a:b = X a, X a+1,..., X b 1 is a contiguous block of random variables generated b G. In particular, its growth the entrop rate h µ = lim L H(L)/L quantifies a process intrinsic randomness. While the excess entrop E = lim L (H(L) h µ L) quantifies how much is predictable: how much future information can be predicted from the past []. Finite-length entrop-rate estimates h µ (L) = H[X 0 X L:0 ] provide increasingl better approximations to the true entrop rate h µ as L grows large.

3 While finite-length excess-entrop estimates: E(L) = H(L) h µ L (1) L 1 = (h µ (l) h µ ). () l=0 tend to the true excess entrop E as L grows large [3]. As there, we consider onl finitar processes, those with finite E [4]. Predictive features R from some alphabet F are formed b compressing the process past X :0 in was that implicitl retain information about the future X 0:. (From hereon, our block notation suppresses infinite indices.) A predictive distortion quantifies the predictabilit lost after such a coarse-graining: d(r) = I[X :0 ; X 0: R] = E I[R; X 0: ]. On the one hand, distortion achieves its maximum value E when R captures no information from the past that could be used for prediction. On the other, it can be made to vanish triviall b taking the predictive features R to be all of the possible histories X :0. Here, we quantif the cost of a larger feature space in two was: b the number F of predictive features and b their coding cost H[R] [5]. Results Ideall, we would identif the minimal number F of predictive features or the coding cost H[R] required to achieve at least a given level d of predictive distortion. This is almost alwas a difficult optimization problem. However, we can place upper bounds on F and H[R] b constructing suboptimal predictive feature sets that achieve predictive distortion d. We start b reminding ourselves of the optimal solution in the limit that d = 0 the causal states S [10, 11, 13]. Causal states can be defined b calling two pasts, x :0 and x :0, equivalent when using them our predictions of the future are the same: Pr(X 0: X :0 = x :0 ) = Pr(X 0: X :0 = x :0). (These conditional distributions are referred to as future morphs.) One can then form a model, the ɛ-machine, from the set S of causal-state equivalence classes and their transition operators. The Shannon entrop of the causal state distribution is the statistical complexit: C µ = H[S]. A process ɛ-machine can also be viewed as the minimal unifilar HMM capable of generating the process [6] and C µ the amount of historical information the process stores in its causal states. Though the mechanics of working with causal states can become rather sophisticated, the essential idea is that causal states are designed to capture everthing about the past relevant to predicting the future and onl that information. In the lossless limit, when d = 0, one cannot find predictive representations that achieve F smaller than S or that achieve H[R] smaller than C µ. Similarl, no optimal loss predictive representation will ever find F > S or H[R] C µ. When the number of causal states is infinite or statistical complexit diverges, as is tpical, as we noted, these bounds are quite useless; but otherwise, the provide a useful calibration for the feature sets proposed below. Markov Features Several familiar predictive models use pasts of length L as predictive features. This feature set can be thought of as constructing an order-l Markov model of a process [7]. The implied predictive distortion is d(l) = I[X :0 ; X 0: X 0:L ] = E E(L) [8], while the number of features is the number of length-l words with nonzero probabilit and their entrop H[R] = H(L). Generall, h µ (L) converges exponentiall quickl to the true entrop rate h µ for stochastic processes generated b finite-state HMMs, unifilar or not; see Ref. [9] and references therein. Then, h µ (L) h µ Ke λl in the large L limit. From Eq. (), we see that the convergence rate λ also implies an exponential rate of deca for E E(L) K e λl. Additionall, according to the asmptotic equipartition propert [5], when L is large, F e h0l and H(L) h µ L, where h 0 is the topological entrop and h µ the entrop rate, both in nats. In sum, this first set of predictive features effectivel, the construction of order-l Markov models ields an algebraic tradeoff between the size of the feature set F and the predictive distortion d: F ( ) h0/λ 1, (3) d and a logarithmic tradeoff between the entrop of the features and distortion: H[R] h ( ) µ 1 λ log. (4) d In principle, λ can be arbitraril small, and so h 0 /λ and h µ /λ arbitraril large, even for processes generated b finite-state HMMs. This can be true even for finite unifilar HMMs (ɛ-machines). To see this, let W be the transition matrix of a process mixed-state presentation; defined shortl. From Ref. [8], when W s spectral gap γ is small, we have E E(L) (1 γ) L, so that λ = log 1 1 γ can be quite small. In short, when the process in question has onl a finite number of causal states, F optimall saturates at S and H[R] optimall saturates at C µ, but from Eqs. (3) and (4), the size of this Markov feature set can grow

4 3 without bound when we attempt to achieve zero predictive distortion. Mixed-State Features A different predictive feature set comes from coarse-graining the mixed-state simplex. As first described b Blackwell [15], mixed states Y are probabilit distributions Pr(G 0 X L:0 = x L:0 ) over the internal states G of a generative model given the generated sequences. Transient mixed states are those at finite L, while recurrent mixed states are those remaining with positive probabilit in the limit that L. Recurrent mixed states exactl correspond to causal states S [30]. When C µ diverges, recurrent mixed states often la on a Cantor set in the simplex; see Fig. 1. In this circumstance, one examines the various dimensions that describe the scaling in such sets. Here, for reasons that will become clear, we use the box-counting dim 0 (Y ) and information dimensions dim 1 (Y ) [31]. More concretel, we partition the simplex into cubes of side length ɛ, and each nonempt cube is taken to be a predictive feature in our representation. When there are onl a finite number of causal states, then this feature set consists of the causal states S for some nonzero ɛ. There is no such ɛ if, instead, the process has a countable infinit of causal states. Sometimes, as for the finite S case, the information dimension and perhaps even the box-counting dimension of the mixed states will vanish. When there is an uncountable infinit of causal states and ɛ is sufficientl small, the corresponding number of features scales as: F ( ) dim0(y ) 1 ɛ where dim 0 (Y ) denotes the box-counting dimension [3] and the coding cost scales as: H[R] dim 1 (Y ) log 1 ɛ, where dim 1 (Y ) is the information dimension. These dimensions can be approximated numericall; see Fig. 1 and Supplementar Materials App. A. For processes generated b infinite ɛ-machines, finding the better feature set requires analzing the scaling of predictive distortion with ɛ. Supplementar Materials App. B shows that d(r) at coarse-graining ɛ scales at most as ɛ in the limit of asmptoticall small ɛ. Our upper bound relies on the fact that two nearb mixed states have similar future morphs, i.e., the have similar conditional probabilit distributions over future trajectories given one s mixed state. From this, we conclude that: F ( ) dim0(y ) 1 d, (5) and H[R] dim 1 (Y ) log 1 d. (6) In general, both dim 0 (Y ) and dim 1 (Y ) are bounded above b G, the number of states in the minimal generative model. Resource-Prediction Tradeoff Finall, putting Eqs. (3)-(6) together, we find that for a process with an uncountabl infinite ɛ-machine, the necessar number of predictive features F scales no faster than: F ( ) min(h0/λ,dim 1 0(Y )) d (7) and the requisite coding cost H[R ] scales no faster than: H[R ] min(h µ /λ, dim 1 (Y )) log 1 d, (8) where d is predictive distortion. These bounds have a practical interpretation. From finite data, one can onl justif inferring finite-state minimal maximall predictive models. Indeed, the criteria of Ref. [33] applied as in Ref. [13] suggests that the maximum number of inferred states should not ield predictive distortions below the noise in our estimate of predictive distortion d from T data points. This noise scales as 1/ T, since from T data points, we have approximatel T separate measurements of predictive distortion. This, in turn, sets upper bounds on F T 1 min(h0/λ,dim0(y )) and H[R ] min(h µ /λ, dim 1 (Y )) log T when the process in question has an uncountable infinit of causal states. We can test this prediction directl using the Baesian Structural Inference (BSI) algorithm [34] applied to the processes described in Fig. 1. The major difficult in emploing BSI is that one must specif a list of ɛ-machine topologies to search over. Since the number of such topologies grows super-exponentiall with the number of states [35], experimentall probing the scaling behavior of the number of inferred states with the amount of available data is, with nothing else said, impossible. However, expert knowledge can cull the number of ɛ-machine topologies that one should search over. In this spirit, we focus on the Simple Nonunifilar Source (SNS), since ɛ-machines for renewal processes have been characterized in detail [36 39]. The SNS s generative HMM is given in Fig. (left). This process has a mixed-state presentation similar to that of nond in Fig. 1(left). We choose to onl search over the ɛ-machine topologies that correspond to the class of eventuall Poisson renewal processes. We must first revisit the upper bound in Eq. (7), as the

5 /18/017 SNS_10_6_bins_1_100.svg /18/017 Fractal_10_6_1000bins.svg /18/017 figure_uncountabled.svg 4 FIG. 1. Mixed state presentations Y for three different generative models given in Supplementar Materials App. A. To visualize the mixed-state simplex, the diagrams plot iterates t such that the left vertex corresponds to the pure state Pr(A, B, C) = (1, 0, 0) or HMM state A, the right to pure state (0, 1, 0) or HMM state B, and the top to pure state (0, 0, 1) or HMM state C. Left plot has 55 mixed states plotted in a bin histogram; middle plots, 391, 484 mixed states in a bin file:///users/chaos/research/coarse%0grained%0mixed%0states/code/sns_10_6_bins_1_100.svg 1/1file:///Users/chaos/Research/Coarse%0Grained%0Mixed%0States/code/Fractal_10_6_1000bins.svg 1/1file:///Users/chaos/Research/Coarse%0Grained%0Mixed%0States/code/figure_uncountabled.svg 1/1 histogram; and right plots 1, 53, 360 mixed states in a bin histogram. Bin cell coloring is negative logarithm of the normalized bin counts. The box-counting dimensions are respectivel: dim 0(Y ) 0.3 at left, dim 0(Y ) 1.8 in the middle, and dim 0(Y ) 1.9 at right. Box-counting dimension calculated b estimating the slope of log(1/ɛ) versus log N ɛ as described in Supplementar Materials App. A. SNS has a countable (not uncountable) infinit of causal states. When a process ɛ-machine is countable instead of uncountable, then we can improve upon Eq. (7). However, the magnitude of improvement is process dependent and not easil characterized in general for two main reasons. First, predictive distortion can decrease faster than ɛ for processes generated b countabl infinite ɛ-machines since there might onl be one mixed state in an ɛ hpercube. (See the lefthand factor in Supplementar Materials Eq. (B5), r R (ɛ) :H[Y R (ɛ) =r]=0 π(r). We are guaranteed that lim ɛ 0 r R (ɛ) :H[Y R (ɛ) =r]=0 π(r) = 0, but the rate of convergence to zero is highl process dependent.) Second, the scaling relation between N ɛ and 1/ɛ ma be subpower law. Given a specific process, though, with a countabl infinite ɛ-machine here, the SNS we can derive expected scaling relations. See Supplementar Materials App. C. We argue that, roughl speaking, we should expect predictive distortion to deca exponentiall with N ɛ, so that d(r ɛ ) λ Nɛ for some λ < 1. As we expect, since our uncertaint in d scales as 1/ T, where T is the amount of data, we expect that the number of inferred states N T should scale as log T. These scaling relationships are confirmed b BSI applied to data generated from the SNS, where the set of ɛ-machine topologies selected from are the eventuall Poisson ɛ-machine topologies characterized b Ref. [36]. Given a set of machines M and data x 0:T, BSI returns an easil-calculable posterior Pr(M x 0:T ) for M M with (at least) two hperparameters: (i) the concentration parameter for our Dirichlet prior on the transition probabilities α and (ii) our prior on the likelihood of a model M with number of states M, taken to be proportional to e β M for a user-specified β. From this posterior, we calculate an average model size M (T ) = M M M Pr(M x 0:T ) as a function of T for multiple data strings x 0:T. The result is that we find the exact scaling of M (T ) with T is proportional to log T in the large T limit, where the proportionalit constant and the initial value depends on hperparameters α and β. Conclusion We now better understand the tradeoff between memor and prediction in processes generated b a finite-state Hidden Markov model with infinite statistical complexit. Importantl and perhaps still underappreciated, these processes are more tpical than those with finite statistical complexit and Gaussian processes, which have received more attention in related literature [1, 13]. We proposed a new method for obtaining predictive features coarse-graining mixed states and used this feature set to bound the number of features and their coding cost in the limit of small predictive distortion. These bounds were compared to those obtained from the more familiar and widel used (Markov) predictive feature set that of memorizing all pasts of a fixed length. The general results here suggest that the new feature set can outperform the standard set in man situations. Practicall speaking, the new bounds presented have three potential uses. First, the give weight to Ref. [7] s suggestion to replace the statistical complexit with the information dimension of mixed-state space as a complexit measure when statistical complexit diverges.

6 5 1 p 0 M A β = 1 β = 4 β = 8 p 0 q 1 B 1 q T FIG.. Model-size scaling for the parametrized Simple Nonunifilar Source (SNS). (Top) Generative HMM for the SNS. (Bottom) Scaling of M = M M P (M x0:t ) with T, with the posterior P (M x 0:T ) calculated using formulae from BSI [34] with α = 1 and the set of model topologies being the eventuall Poisson ɛ-machines described in Ref. [36]. Data generated from the SNS with p = q = 1. Linear regression reveals a slope of 1 for log log T vs. log M, confirming the expected N T log T, where the proportionalit constants depend on β. Second, the suggest a route to an improved, processdependent tradeoff between complexit and precision for estimating the entrop rate of processes generated b nonunifilar HMMs [40]. (Admittedl, here space kept us from addressing how to estimate the probabilit distribution over ɛ-boxes.) Finall, and perhaps most importantl, our upper bounds provide a first attempt to calculate the expected scaling of inferred model size with available data. It is commonl accepted that inferring time-series models from finite data tpicall has two components parameter estimation and model selection though these two can be done simultaneousl. Our focus above was on model selection, as we monitored how model size increased with the amount of available data. One can view the results as an effort to characterize the posterior distribution of ɛ-machine topologies given data or as the growth rate of the optimum model cost [41] in the asmptotic limit. Posterior distributions of estimated parameters (transition probabilities) are almost alwas asmptoticall normal, with standard deviation decreasing as the square root of the amount of available data [4 44]. However, asmptotic normalit does not tpicall hold for this posterior distribution, since using finite ɛ-machine topologies almost alwas implies out-ofclass modeling. Rather, we showed that the particular wa in which the mode of this posterior distribution increases with data tpicall depends on a gross process statistic the box-counting dimension of the mixed-state presentation. Stepping back, we onl tackled the lower parts of the infinitar process hierarch identified in Ref. [45]. In particular, complex processes in the sense of Ref. [46] have different resource-prediction tradeoffs than analzed here, since processes generated b finite-state HMMs (as assumed here) cannot produce processes with infinite excess entrop. To do so, at the ver least, predictive distortion must be more carefull defined. We conjecture that success will be achieved b instead focusing on a one-step predictive distortion, equivalent to a self-information loss function, as is tpicall done [41, 47]. Luckil, the derivation in Supplementar Materials App. B easil extends to this case, simultaneousl suggesting improvements to related entrop rate-approximating algorithms. We hope these introductor results inspire future stud of resource-prediction tradeoffs for infinitar processes. The authors thank Santa Fe Institute for its hospitalit during visits and A. Bod, C. Hillar, and D. Upper for useful discussions. JPC is an SFI External Facult member. This material is based upon work supported b, or in part b, the U.S. Arm Research Laborator and the U. S. Arm Research Office under contracts W911NF and W911NF S.E.M. was funded b a National Science Foundation Graduate Student Research Fellowship, a U.C. Berkele Chancellor s Fellowship, and the MIT Phsics of Living Sstems Fellowship. [1] S. Still. EuroPhs. Lett., 85:8005, 009. [] M. L. Littman, R. S. Sutton, and S. P. Singh. In NIPS, volume 14, pages , 001. [3] N. Tishb and D. Polani. In Perception-action ccle, pages Springer, 011. [4] D. Little and F. Sommer. Learning and exploration in action-perception loops. Frontiers in Neural Circuits, 7:37, 013. [5] S. Still and D. Precup. Theor in Biosciences, 131(3): , 01. [6] N. Brodu. Adv. Complex Ss., 14(05): , 011. [7] J. P. Crutchfield. Phsica D, 75:11 54, [8] S. Still and J. P. Crutchfield. arxiv: [9] T. Berger. Rate Distortion Theor. Prentice-Hall, New

7 6 York, [10] J. P. Crutchfield and K. Young. Phs. Rev. Let., 63: , [11] C. R. Shalizi and J. P. Crutchfield. J. Stat. Phs., 104: , 001. [1] F. Creutzig, A. Globerson, and N. Tishb. Phs. Rev. E, 79(4):04195, 009. [13] S. Still, J. P. Crutchfield, and C. J. Ellison. CHAOS, 0(3):037111, 010. [14] S. Marzen and J. P. Crutchfield. J. Stat. Phs., 163(6): , 014. [15] D. Blackwell. Transactions of the first Prague conference on information theor, Statistical decision functions, Random processes, 8:13 0, Held at Liblice near Prague from November 8 to 30, [16] L. R. Rabiner and B. H. Juang. IEEE ASSP Magazine, Januar, [17] E. J. Gilbert. Ann. Math. Stat., pages , [18] H. Ito, S.-I. Amari, and K. Kobaashi. IEEE Info. Th., 38:34, 199. [19] V. Balasubramanian. Neural Computation, 9: , [0] D. R. Upper. Theor and Algorithms for Hidden Markov Models and Generalized Hidden Markov Models. PhD thesis, Universit of California, Berkele, Published b Universit Microfilms Intl, Ann Arbor, Michigan. [1] This is not necessaril the same as the generative model with minimal generative complexit [48], since a nonunifilar generator with a large number of sparsel-used states might have smaller state entrop than a nonunifilar generator with a small number of equall-used states. And, tpicall, per our main result, the minimal generative model is usuall not the same as the minimal prescient (unifilar) model. [] Cf. Ref. [3] s discussion of terminolog and Ref. [46]. [3] J. P. Crutchfield and D. P. Feldman. CHAOS, 13(1):5 54, 003. [4] Finite HMMs generate finitar processes, as E is bounded from above b the logarithm of the number of HMM states [48]. [5] T. M. Cover and J. A. Thomas. Elements of Information Theor. Wile-Interscience, New York, second edition, 006. [6] N. Travers and J. P. Crutchfield. Equivalence of histor and generator ɛ-machines. arxiv.org: [7] One can construct a better feature set simpl b appling the causal-state equivalence relation to these length- L pasts. We avoid this here since the size of this feature set is more difficult to analze genericall and since the causal-state equivalence relation applied to length-l pasts often induces no coarse-graining. [8] P. M. Ara, R. G. James, and J. P. Crutchfield. Phs. Rev. E, 93():0143, 016. [9] N. F. Travers. Stochastic Proc. Appln., 14(1): , 014. [30] C. J. Ellison, J. R. Mahone, and J. P. Crutchfield. J. Stat. Phs., 136(6): , 009. [31] Ya. B. Pesin. Universit of Chicago Press, [3] We assume here that the lower box-counting dimension and upper box-counting dimension are equivalent. [33] S. Still and W. Bialek. Neural Comp., 16(1): , 004. [34] C. C. Strelioff and J. P. Crutchfield. Phs. Rev. E, 89:04119, 014. [35] B. D. Johnson, J. P. Crutchfield, C. J. Ellison, and C. S. McTague. arxiv.org: [36] S. Marzen and J. P. Crutchfield. Entrop, 17(7): , 015. [37] S. Marzen and J. P. Crutchfield. Phs. Lett. A, 380(17): , 016. [38] S. Marzen, M. R. DeWeese, and J. P. Crutchfield. Front. Comput. Neurosci., 9:109, 015. [39] S. Marzen and J. P. Crutchfield. arxiv: [40] E Ordentlich and T Weissman. Entrop of Hidden Markov Processes and Connections to Dnamical Sstems, London Math. Soc. Lecture Notes, 385: , 011. [41] J. Rissanen. IEEE Trans. Info. Th., IT-30:69, [4] A. M. Walker. J. Ro. Stat. Soc. Series B, pages 80 88, [43] C. C. Hede and I. M. Johnstone. J. Ro. Stat. Soc. Series B, pages , [44] T.J. Sweeting. Baesian Statistics, 4:85 835, 199. [45] J. P. Crutchfield and S. Marzen. Phs. Rev. E, 91(5):050106, 015. [46] W. Bialek, I. Nemenman, and N. Tishb. Neural Comp., 13: , 001. [47] N. Merhav and M. Feder. IEEE Trans. Info. Th., 44(6):14 147, [48] W. Löhr. Models of discrete-time stochastic processes and associated complexit measures. PhD thesis, Universit of Leipzig, Ma 009. [49] S.-W. Ho and R. W. Yeung. The interpla between entrop and variational distance. IEEE Trans. Info. Th., 56(1): , 010. [50] T. Tao. An Introduction to Measure Theor, volume 16. American Mathematical Societ, 011. [51] This upper bound on E was noted in Ref. [48].

8 7 Supplementar Materials for Nearl Maximall Predictive Features and Their Dimensions Sarah E. Marzen and James P. Crutchfield Appendix A: Estimating fractal dimensions We start b describing the generative models with mixed-state presentations shown in Fig. 1. Figure 1(left) has labeled transition matrices of: / 1/3 T (0) = and T (1) = 0 1/ 1/ /3 Figure 1(middle) and Fig. 1(right) have labeled transition matrices parametrized b x and α: α βx βx β αx βx β βx αx T (0) = αx β βx, T (1) = βx α βx, and T () = βx β αx, αx βx β βx αx β βx βx α with β = 1 α and = 1 x. Figure 1(middle) corresponds to x = 0.15 and α = 0.6, while Fig. 1(right) corresponds to x = 0.05 and α = 0.6. The probabilities of emitting a 0, 1, and given the current state are, respectivel: p(x A) = p(x B) = p(x C) = { α x = 0 1 α x = 1,, { α x = 1 1 α x = 0,, and { α x = 1 α x = 0, 1. 1 x C A B x C x x A x x x x B 1 x FIG. 3. (Left) Minimal generative model for the nond process whose mixed-state presentation is shown in Fig. 1(left). (Right) Parametrized minimal generative model for the mess3 process. For Fig. 1(middle), x = 0.15 and α = 0.6 and for Fig. 1(right), x = 0.05 and α = 0.6. The box-counting dimension is approximated b forming the mixed-state presentation from the generative model as described b Ref. [30] down to the desired tree depth. For Fig. 1(left), the depth was sufficient to accrue 10, 000 mixed states; for Fig. 1(middle) and Fig. 1(right), the depth was sufficient to accrue 30, 000 mixed states. Box-counting dimension was estimated b partitioning the simplex into boxes of side length ɛ = 1/n for n {1,..., 300}, calculating the number N ɛ of nonempt boxes, and using linear regression on log N ɛ versus log 1 ɛ to find the slope. Information dimension could be estimated as well b estimating probabilities of words that lead to each mixed state.

9 8 Appendix B: Predictive distortion scaling for coarse-grained mixed states Recall that the second feature set (alphabet F, random variable R) is constructed b partitioning the simplex into boxes of side-length ɛ. This section finds the scaling of d(r) with ɛ in the limit of asmptoticall small ɛ. For reasons that become clear shortl, we define: d L (R) = I[X :0 ; X L R] (B1) = H[ X L R] H[ X L X :0 ], (B) where lim L d L (R) = d(r). Let π(r) be the invariant probabilit distribution over ɛ-boxes, and let π( r) be the probabilit measure over mixed states in that ɛ-box. Then: d L (R) = ( π(r) H[ X L R = r] dπ( r) H[ ) X L G ], (B3) r where: Pr( X L R = r) = dπ( r) Pr( X L G ). (B4) To place an upper bound on d L (R) in terms of ɛ, we note that for an two mixed states and in the same partition: 1 G ɛ, the length of the longest diagonal in a hpercube of dimension G, b construction. Hence: Pr( X L G ) Pr( G X L G ) T V = Pr( X L G = i)( i i) From earlier, though, this is simpl: i=1 G T V Pr( X L G = i) T V i i i=1 G = i i. i=1 Pr( X L G ) Pr( X L G ) T V 1 G ɛ. (Often the literature defines T V as 1/ of the quantit used here.) From this, we can similarl conclude that: Pr( X L R = r) Pr( X L G ) T V = dπ( r) Pr( X L G ) Pr( X L G ) T V dπ( r) Pr( X L G ) Pr( X L G ) T V dπ( r) G ɛ = G ɛ for an mixed state in partition r. Reference [49] gives us an upper bound on differences in entrop in terms of the

10 9 total variation, which here implies that: ( ) H[ X L R = r] H[ G ɛ X L G ] H b + ɛl log A. This, in turn, gives: H[ X L R = r] dπ( r)h[ X L G ] dπ( r) H[ X L R = r] H[ X L G ] ( ) G ɛ H b + ɛl log A. With that having been said, if there is onl one mixed state in some ɛ-cube r, the quantit above vanishes: H[ X L R = r] dπ( r)h[ X L G ] = 0. Denote the probabilit of a nonempt ɛ-cube having onl one mixed state in it as r F:H[Y R=r]=0 π(r). substitution into Eq. (B3) gives: d L (R) r F:H[Y R=r]=0 ( π(r) H b ( G ɛ Note that this implies the entrop rate can be approximated with error no greater than ) Then ) + ɛl log A. (B5) ( ( ) ) G ɛ H b + ɛ log A if one knows the distribution over ɛ-boxes, π(r). The left factor r F:H[Y R=r]=0 π(r) tends to zero as the partitions shrink for a process generated b a countable ɛ-machine and not so otherwise. For ease of presentation, we first stud the case of uncountabl infinite ɛ-machines, and then comment on the case of processes produced b countable ɛ-machines. For clarit, we now introduce the notation R (ɛ) to indicate the predictive features obtained from an ɛ-partition of the mixed-state simplex. From Eq. (B5), we can readil conclude that for an finite L: d L (R (ɛ) ) lim ɛ 0 ɛ γ = 0, when γ < 1. We would like to extend this conclusion to d(r (ɛ) ), but inspection of Eq. (B5) suggests that concluding anthing similar for d(r (ɛ) ) requires care. In particular, we would like to show that lim ɛ 0 d(r (ɛ) )/ɛ γ = 0 for an γ < 1. Recall that d(r) = lim L d L (R). If we can exchange the limits lim ɛ 0 and lim L, then: d(r (ɛ) ) lim ɛ 0 ɛ γ = lim lim ɛ 0 L = lim L lim ɛ 0 = 0. d L (R (ɛ) ) ɛ γ d L (R (ɛ) ) ɛ γ To justif exchanging limits, we appeal to the Moore-Osgood Theorem [50]. Consider an monotone decreasing sequence {ɛ i } i=1 of ɛ, such that a i,l := d L (R (ɛi) )/ɛ γ i is a doubl-infinite sequence. When γ < 1, we have that lim i a i,l = 0. We provide a plausibilit argument for uniform convergence of a i,l to 0. Recall from Eq. (B5) that: a i,l 0 = a i,l 1 ɛ γ i ( H b ( G ɛi ) ) + ɛ il log A.

11 10 And, since H b (x) x log x + x, this expands to: a i,l G ɛ 1 γ i log (1/ɛ i ) + ( G ) G log + + L log A ɛ 1 γ i. G At small enough ɛ i, the first term dominates, impling that a i,l G ɛ 1 γ 1 i log ɛ i. B making i sufficientl large (i.e., b making ɛ i sufficientl small) we can make this upper bound as small as desired. Finall, the Data Processing Inequalit implies that I[X :0 ; X L R] is monotone increasing in L and upper bounded b E < log G < [51]. And so, a i,l is also monotone increasing in L and upper-bounded b E/ɛ γ i. Hence, the monotone convergence theorem implies that lim L a i,l exists for an i. Therefore, the conditions of the Moore-Osgood Theorem are satisfied, and lim i lim L d L (R (ɛi) )/ɛ γ i = 0, so lim ɛ 0 d(r (ɛ) )/ɛ γ = 0 for an γ < 1 as desired. Loosel speaking, as in the main text, we sa simpl that d(r (ɛ) ) scales as ɛ for small ɛ. Appendix C: Countabl infinite ɛ-machine example: the simple nonunifilar source Consider the parametrized Simple Nonunifilar Source (SNS), represented b Fig. 4. It is straightforward to show that the mixed states lie on a line in the simplex {(v n, 1 v n )} n=0 with: v n = (1 p) n (q p) (1 p) n q (1 q) n p. If p < q, then lim n v n = 1 p/q; if q < p, then lim n v n = 0. It is also straightforward to show that v n converges exponentiall fast to its limit when p q: v n lim v (1 p q n )( p q ( 1 q 1 p )n ) p < q ( ) n n (1 q p ) 1 p q < p. However, when p = q, then v n = 1 q n n+( 1 p 1), with lim n v n = 1. For an p, v n lim n v n (1/p) 1 n. This is a much slower rate of convergence. This greatl affects the boxcounting dimension of the parametrized SNS s mixed-state presentation. In fact, dim 0 (Y ) is nonzero (and roughl 1/) onl when p = q; see Fig. 4(bottom). Additionall, for the parametrized SNS, causal states act as a counter of the number of 0 s since last 1. So, we can think of the predictive distortion at coarse-graining ɛ as d(r ɛ ) E E(N ɛ ), which decreases exponentiall quickl with N ɛ rather than algebraicall under weak conditions. (See, for example, Ref. [9] and references therein.) Alternativel, from Eq. (B5), we note that n i=0 π(i) decreases exponentiall for the parametrized SNS; see Ref. [36] for details. From Fig. 4, we see that: N ɛ { 1/ɛ p = q (1/ɛ) p q.

12 11 10 p = q = 0.5 q = 0.4 < p = 0.5 Nɛ /ɛ FIG. 4. Feature-set scaling of SNS mixed states: Number N ɛ of nonempt ɛ-cubes as a function of 1/ɛ on a log-log plot. When p = q, the scaling relation appears to be power law: N ɛ (1/ɛ) 1/. When p q, N ɛ scales more slowl than a power law with 1/ɛ, seemingl at a rate N ɛ ( log 1 ɛ ).

On Information and Sufficiency

On Information and Sufficiency On Information and Sufficienc Huaiu hu SFI WORKING PAPER: 997-02-04 SFI Working Papers contain accounts of scientific work of the author(s) and do not necessaril represent the views of the Santa Fe Institute.