On complexity and randomness of Markov-chain prediction

On complexty and randomness of Markov-chan predcton Joel Ratsaby Department of Electrcal and Electroncs Engneerng Arel Unversty Arel, ISRAEL Emal: ratsaby@arelacl Abstract Let {X t : t Z} be a sequence of bnary random varables generated by a statonary Markov source of order k Let be the probablty of the event X t = Consder a learner based on a Markov model of order k, where k may be dfferent from k, who trans on a sample sequence X m) whch s randomly drawn by the source Test the learner s performance by askng t to predct the bts of a test sequence X n) generated by the source) An error occurs at tme t f the predcton Y t dffers from the true bt value X t, apple t apple n Denote by n) the sequence of errors where the error bt t at tme t equals or 0 accordng to whether an error occurred or not, respectvely Consder the subsequence ) of n) whch corresponds to the errors made when predctng a 0, that s, ) conssts of those bts t at tmes t where Y t =0 In ths paper we compute an upper bound on the absolute devaton between the frequency of n ) and The bound has an explct dependence on k, k, m,, n It shows that the larger k, or the larger the dfference k k, the less random that ) can become Overvew Let Z denote the set of all ntegers Let {X t : t Z} be a sequence of bnary random varables possessng the followng Markov property, X t = x,x t = x t,x t = x t ) = X t = x X t k = x t k,,x t = x t ) for some fxed k where x t k,,x t, x take a bnary value of 0 or, for t an nteger Ths sequence s known as a k th order Markov chan The Markov chan follows a Markov probablty model M k whch conssts of a state space S k = {0, } k and a k k state-transton probablty matrx Q We denote the th state of M k by s ), =0,,, k, wth s 0) := [s 0) k,,s0) 0 ]= 0 ]= [0,,0, 0] the all-zero state), s ) := [s ) k,,s) [0,,0, ],,s k ) := [,,] The state-transton matrx s defned as Q[, j] := p s j) s ) where the model s parameters are the state transton probabltes p s j) s ) Based on M k, the Markov chan {X t : t Z} gves rse to a random state sequence {S t : t Z} where S t := X t k+,x t k+,,x t ) S k s the random state at tme t When Q s not dependent on tme we refer to M k as a homogeneous Markov model October 5, 07, corrected verson of the paper, "On complexty and randomness of Markov chan predcton, roceedngs of the IEEE Informaton Theory Workshop ITW 5)" The structure of M k allows for only two possble outgong transtons from a state S t to the next state S t+ snce S t+ can take only one of the two values X t k+,,x t, 0) or X t k+,,x t, ) We call them type-0 and type- transtons Assocated wth each state s ) are two non-zero valued parameters t,j whch are the probablty of transton to those two states s j) that are obtaned by a type-0 or type- transton from state s ) We denote these parameters by p ) and p0 ) = p ) Hence the set p ) :0appleapple k serves as parameters of M k In ths paper we consder the followng problem: roblem: Let k, k, m, n be fxed postve ntegers Let the source be a homogeneous statonary Markov chan that s assocated wth a Markov probablty model M k of order k From the source, we sample m+max {k, k } consecutve values and obtan a statonary fnte Markov chan X m) := {X t } m t= max{k,k }+, ) denoted as the tranng sequence A learner not knowng the value of k and the model M k ) estmates the parameters transton probabltes) of a Markov model M k of order k based on X m), where as the ntal state he uses S 0 =X k+,,x 0 ), and X t s the t th bt of X m) After tranng, the testng stage begns and the above samplng s repeated n order to obtan an n + max {k, k })-long statonary Markov chan, denoted as the testng sequence, X n) := {X t } n t= max{k,k }+ The learner uses M k to predct X n) by startng at the ntal state S 0 := X k+,,x 0 ) where X t s the t th bt of X n) and then usng M k to produce a sequence of predctons Y n) := {Y t } n t= where Y t represents the predcton of bt X t n X n), apple t apple n The predcton Y t may be 0, or a thrd value whch represents nodecson Denote by n) := { t } n t= the correspondng bnary sequence of mstakes where t =f Y t 6= X t and Y t dffers from a no-decson), otherwse t =0 Denote by ) = { tl } l=, `, a subsequence of n) wth tme nstants correspondng to 0-predctons, Y tl =0, for apple l apple apple n Note that ) s also a subsequence of the nput sequence X n) hence effectvely the learner acts as a selecton rule R d whch pcks certan bts ) from X n) accordng to ts predcton rule d In ths paper we determne the relatonshp between the man varables k, k, m, n of the learnng process and the stochastcty property frequency stablty) of the the error subsequence ) One drect consequence of our result

shows that the larger the learner s complexty model s order k) the hgher that ths devaton can be Introducton The concept of randomness of a bnary sequence conveys that a sequence s more unpredctable f t contans fewer patterns or regulartes To test f an nfnte bnary sequence s random the approach of [4] defnes an admssble selecton rule as any partal) computable functon whch reads n bts of the sequence, pcks a bt that has not been read yet, decdes whether t should be selected and f so reads ts value When subsequences that are selected by such a rule satsfy the frequency stablty property they are called KL-stochastc It s known that the complexty of a selecton rule nfluences the stochastcty of the selected subsequence In [] t s shown that the hgher the algorthmc complexty of the selecton rule the less the extent to whch the selected subsequence satsfes the frequency stablty property In ths paper we are not nterested n the analyss of the accuracy of learnng but rather n the randomness property of predcton mstakes We analyze the effect that the complexty of a learner has on the randomness of ts predcton mstakes It receves as nput the sequence X n) and produces an output sequence ) whch takng nto account only the mstakes that correspond to the predcton of zeros of X n), means that ) s a subsequence of X n) In ths respect, the learner acts as an entty that selects bts from X n) and n so dong has an mpact on the level of randomness of the nput X n) Our am s to quantfy ths phenomenon n terms of the frequency nstablty of ) 3 Setup, defntons and notatons Let k, k, `, m, n be gven postve ntegers We use captal letters X, S, Y, to represent random varables and lower case letters x, s, y, to represent ther values To dstngush between the learner s states from the states of the source s model, we place a star n the superscrpt of the symbol that represents a source s state That s, St = s ) means that the random state at tme t s the source s state s ) where s ) S k As mentoned above, the source becomes statonary before producng ether X m) and X n) Let IE) denote the ndcator functon for the logcal Boolean) expresson E Let be the statonary jont probablty dstrbuton of the Markov source and redenote by Q the k k state-transton probablty matrx We denote by := X t = ) the statonary probablty of the event {X t =} In case k >kwe denote by j) the statonary probablty of the source s model beng at state s j), where s j) S k s a bnary number whose k k leftmost dgts amount to a bnary number whose decmal value 0 apple j apple k k and whose k rghtmost bts amount to a bnary number whose decmal value s 0 apple apple k For apple k apple k k we denote by j ) := j)/ k j) the probablty that the k k leftmost bts amount to the decmal value j gven that the k rghtmost bts amount to For apple q apple r, defne the projecton operator < > q : S r S q as a mappng that takes a state s ) S r to a state s =< s ) > q =[s ) q,,s) 0 ] S q whose q dgts correspond to the q least sgnfcant rghtmost) bts of s ) Lookng at the tranng data X m) the learner nterprets the state of the chan accordng to a Markov probablty model M k of order k whch, n general, s dfferent from k Accordng to the learner s model M k, the probablty that the chan makes a type- transton from state s ) s p ) where s ) S k If k k then p ) = p <s ) > k whch s a parameter of the source s model M k If k< k then p ) = k X k ) p S = s j) j ) 3) where the state s j) s defned above In ether case, p ) s determned drectly from M k snce t s determned by the statonary probablty dstrbuton and the transton probabltes p S =<s ) > k or p S = s j) of the source s model M k Thus the true values of the learner s model parameters p ) whch are unknown to the learner are completely determned by the source s model M k Based on the tranng sequence ) let S m) = {S t } m t=, S t =[X t k +,,X t ], apple t apple m 4) and S m) = {S t } m t=, S t =[X t k+,,x t ], apple t apple m Note that there s a one-to-one correspondence between these two sequences because gven one of the two sequences we can obtan the unquely correspondng sequence X m) from whch the other sequence s obtaned Let denote the proporton number of tmes that state s ) S k appears n the state sequence S m) We denote by := S m) )=[ 0 S m) ),, k S m) )] where satsfes k =0 = For brevty we also wrte := X m) )= S m) ) For nstance, f m =6, k =3, max {k, k } =3and =then s ) = [00] and for the sequence X m) = 000000 we have = 6 snce the states at tme nstants t =,,, are s 4),s ),, respectvely We denote by = E[ ] the statonary probablty that S t = s ), for apple t apple m Let ˆp ) := m X l: S tl =s ) I {X tl + =} 5) be the frequency of type- transtons from state s ) n the sequence S m) The ndex n the sum runs over those tme nstants {0,,m }, apple l apple m, where S tl = s ) 4 Learnng a decson rule The learner uses the tranng sequence to estmate the parameters p ) by ˆp ), 0 apple apple k Based on

these random parameter estmates, the learner defnes a decson rule that has the opton of choosng a REJECT acton Ths opton s a standard way to mnmze decson errors by refranng from makng a decson wth a small confdence t had been n use aeast as early as []) Defne the set :={0,,REJECT} The decson rule of the learner s denoted by ˆd k and s defned as ˆd := ˆdX m) )=[ˆd0),, ˆd k )], 6) where, f =0, X m) dd not vst state s ) so we let ˆd) =REJECT and f > 0 then 8 ˆd) := ˆd, < f ˆp ) > + T,) ) = 0 f ˆp ) < : T,) REJECT otherwse 7) for 0 apple apple k, and T defnes a threshold that determnes f a non-reject decson s to be made The value of T depends on a confdence parameter 0, ] chosen by the learner and on the state proporton 0, ] We henceforth wrte ˆd wthout explctly showng the dependence on Gven a fxed and a tranng sequence x m), the learner measures x m) ) then evaluates the threshold T,) for each state s ), 0 apple apple k Wth these threshold values, the learner has a decson rule ˆd whch s defned accordng to 7) Note that ˆd completely descrbes the decson made by the learner at any state of the model M k The functon T decreases wth ncreasng and ncreasng It s defned such that f ˆd) = then wth probablty at most the Bayes optmal decson s d ) =0, or else f ˆd) =0then wth probablty at most the Bayes decson s d ) =, 0 apple apple k That s, wth confdence aeast the learner s decson ˆd agrees wth the Bayes optmal decson d We note that n Theorem the choce for s dctated by the choce of a confdence parameter 5 Testng predcton Wth the decson rule ˆd n place, the learner s tested on the n-bt sequence X n) For apple t apple n the learner predcts a bt Y t for bt X t of X n) The value of Y t s determned by the learner s decson rule ˆd, that s, Y t =f the state S t at tme t equals s ) and ˆp ) > + T,), else f the state s s ) and ˆp ) < T,) then Y t =0, otherwse Y t = REJECT The sequence of predcton mstakes n) s defned as follows: for apple t apple n, t =0f Y t = X t or Y t = REJECT, otherwse t = Let ) = { tl } l= be the subsequence of predcton mstakes where the tme nstants correspond to 0-predctons, Y tl =0, apple l apple We are nterested n the event that ths subsequence has a length ` and the frequency of n ) that devates n absolute value) from by aeast Note that ) s a subsequence of the nput sequence X n) and hence the learner effectvely acts as a selecton rule whch pcks out a random subsequence of bts ) from X n) We now state the man result of the paper 6 Result In the full paper [5] we show that there exsts a fnte nteger l 0, such that for l l 0, the transton matrx Q satsfes Q l > 0, that s, every entry of Q l, denoted by p l) s j) s ) ), s postve Ths l 0 can be evaluated gven Q just by computng Q l for a sequence of l untl the frs s found such that Q l > 0 Denote by µ 0 := mn,j pl0) s j) s ) We use l 0 and µ 0 n the followng defnton Let k, l 0 ):= Let k k µ 0 ) l0 )/l0 k µ 0 ) /l0 rk, k ):= f k k + k k + f k apple k Theorem Let m, n, k, k be postve ntegers Let M k be a Markov model of order k used by the learner Let X m) = {X t } m t= max k,k +, Xn) = {X t } n t= max k,k + be two statonary bnary Markov chans of order k produced by the same source Let be the statonary probablty X t = ) Denote by := X m) ), 0 apple apple k, the random varables representng the proporton number of tmes that state s ) of M k appears n the state sequence S m) that corresponds to X m) For any 0 < apple, ` {,,n}, ` n, ], let the learner use a decson rule ˆdX m) ) defned accordng to 7) wth the followng choce for T, s T, ):= k 3 k+4, l 0 ) ln m Then wth probablty no more than the subsequence ) of mstakes correspondng to tme nstants when the learner predcts Y tl = 0 for bt X tl of the sequence X n) where {,,n}, satsfes both of the followng condtons: 8) 9) the absolute value of the devaton between and the frequency of n ) s more than := 4rk, k ) k, l 0 ) v u t n k + 4) ln + ln t has a length n ` n ` n 0) The theorem mples that wth probablty aeast, ether the absolute devaton between and the frequency of n ) s no more than 0) or <n, where n the latter case there s no guarantee about the sze of the devaton In the former case, f we take the upper bound 0) on the devaton to be a measure of defcency of randomness of the error subsequence and vew k as the complexty of the learner s model M k, then t follows that the larger the complexty k the less random that the error

subsequence can become The larger the dfference k k between the learner model s order and the source model s order the less random that the subsequence may become One extreme case s when s a very small postve number, makng T, ) > / for all 0 apple apple k In ths case ˆd always decdes REJECT and =0so for any apple ` apple n and `/n, ] the second condton n s false and the probablty of the event stated by the theorem s zero, hence no more than The proof of the theorem s avalable n [5] Due to lmtaton n space, we provde here only a sketch of the proof 7 Sketch of the proof Consderng all possble decson rules based on Markov models M k of order k t s well known that the Bayes decson rule yelds the mnmal expected number of predcton mstakes and s defned by d where the decson at state s ) s d ) =f the true unknown probablty p ) > and d ) =0otherwse, 0 apple apple k The decson rule ˆd n 6) s obtaned based on the random sequence X m) and s hence a random varable tself that may dffer from the optmal Bayes rule We need to bound the absolute devaton between the frequency of n the random subsequence ) of mstakes made by ths random rule ˆd We defne the relaton on the set k k as follows: for d, d 0 k, d d 0 f for all 0 apple apple k ether d) =d 0 ) or one of the values d), d 0 ), s equal to REJECT That s, When d d 0 we say that d and d 0 are equvalent up to rejects The Bayes optmal rule d has no rejects, that s d ) {0, } thus the event ˆd 6 d means that there exsts a state s ) such that ˆd) 6= d ) and ˆd) 6= REJECT and > 0 recall, f =0then ˆd) =REJECT ) In obtanng a bound on the probablty that the absolute devaton s larger than our approach s to consder the two cases, ˆd 6 d and ˆd d In the former case we bound the probablty that ˆd strctly dffers from d and n the latter case we bound the probablty that the devaton s larger than for the subsequence ) of mstakes made by a rule whch s equvalent to the Bayes optmal rule d up to rejects We start wth boundng the probablty of the event ˆd 6 d Consder a state s ) M k and defne the ndcator random varable v ) f a type transton occurs at state S tl = s ) = 0 otherwse We consder two cases, a) k k, b) k<k In case a), the probablty that v ) =only depends on the state s ) at tme, that s on S tl = s ) Therefore from ) we have v ) = = p <s ) > k ) Snce the probablty p < s ) > k s a parameter value of the model M k then t s constant wth respect to tme In case b), from 3) we have k X k v ) = = p St l = s j) j ) ) where both of the probabltes n the sum of ) are completely determned by the source s model M k We henceforth denote by p ) := v ) = rememberng that t s ether ) or ) dependng on whether case a) or b) holds, respectvely It s easy to show that p ) =Eˆp ) n both cases, where expectaton s taken wth respect to the statonary probablty dstrbuton The event ˆd 6 d mples that there exsts an 0,, k wth > 0 such that ˆd) 6= d ) and ˆd) 6= REJECT Ths mples that ˆp ) p ) > T,), > 0 Here ˆp ) depends on snce t s the number of type- transtons from state s ) n the sequence s m) dvded by m, see 5)) Let I := m,, and let A d := x m) {0, } m : 90 apple apple k ) ˆd ) 6= REJECT, x m) > 0, ˆd ) 6= d), The event ˆd 6 d s equvalent to the set A d We have d d 0 ) 8 0 apple apple k n, A d d) =d 0 )) _ d 0 x m) {0, } m : 90 apple apple k, 9 I, ) =REJECT ) _ d) =REJECT ) o ˆp ) p ) > T,), x m) Let 0 < apple and choose to be a functon, ) such that 9 I, ˆp ) p ) > T,, )), ) apple 3) For any 0 < apple f we let = k+ then we obtan A d ) apple X k =0 9 I, ˆp ) p ) > T,, )), X k apple k = 4) =0 We now derve the functons T,) and, ) for cases, a) and b), such that 4) holds In case a), although the sequence s a Markov chan, the ndcator are d Bernoull wth success probablty p ) =p <s ) > k ) Usng Chernoff s bound for the average of a sequence of d Bernoull random varables we obtan that the followng choces random varables v ) r ln m, and T,, )) = T a),) = T a) ensure that the bound n 4) holds In 3, case b) the ndcator random varables v ) are dependent

So we use a dfferent bound for the convergence of ˆp ) to p ) For any fxed 0,, k we have, ˆp ) p ) = ˆp ) p ) apple ˆp ) p ) + p ) Hence, 9, ˆp ) p ) > T,), ) 5) apple 9, ˆp ) p ) > T,), 6) + 9, > T,), 7) p ) Let us fx 0,, k For a state s S k and 0 apple r<qapplek we denote by <s> q r the bnary vector [s q,,s r ] {0, } q r+ Defne f : S k {0, } as follows: for a state s S k and s ) S k let f s) =0, f < s > k 6= s ), and < s > 0 f < s > k = s ) For a sequence s m) = {s t } m t=, s t S k, apple t apple m, defne F s m) ):= m t= f s t ) Therefore ˆp ) can be represented as ˆp ) =F S m) )/m X m) )) where S m) s defned n 4) It can be shown that the functon F s Lpschtz wth respect to the Hammng metrc d H : S m k Sm k [0, ) wth a constant c = Usng a concentraton result for Markov chans Theorem of [3]) and substtutng r for T n 6) the value of T b),) := m ln wth, ) := 4 yelds a bound of / on 6) In a smlar way we obtan a bound of / on 7) It follows that 5) s bounded from above by as requred for 3) to hold We now bound the probablty that the absolute devaton s larger than when the rule ˆd s equvalent to the Bayes optmal rule d up to rejects We look for a predcton error only when the learner predcts Y t =0 Let be the number of tmes that the learner predcts Y t =0 Snce the error subsequence ) s a subsequence of X n) then we can assocate a selecton rule R ˆd : {0, } n {0, } based on ˆd whch selects ) from X n) Defne by 0 := `/n For every >0, apple ` apple n and d k we denote the random event = 9 [ 0, ], ) = R d X n), n, X l= ) l > We wsh to bound the probablty of the event ˆ where ˆd d The sequence ) conssts of dependent random varables Consder ˆd to be any fxed rule d such that d d The sum can be ) l= ) l represented as F d s n) ) := n t= f d t ) where t = s t rk,k )+,,s t,s t ) and f d t ) equals the least sgnfcant bt of t f the predcton Y t whch s based on the state < > k ) equals 0 see 7)) and s zero otherwse The functon F d s Lpschtz wth constant rk, k ) wth respect to the Hammng metrc over the space S n k ) thus we employ the same concentraton result r as above Defne apple, ) := r n ln Then we obtan F d S n) ) EF d S n) ) > napple,) apple For any fxed decson rule d, for any value 0 apple apple 0) and wth the choce of := apple,, t follows that the probablty of the absolute devaton exceedng and n s bounded as apple X 9 [ 0, ] : apple 0 ) We have ˆ = ˆ + t= l ˆd 6 d ˆd 6 d ˆ whch s bounded from above by ˆd 6 d + >apple, 0), n ˆd d ˆd d [ d:dd 8) 9) The frst term s bounded from above by / provded that the learner s threshold T for state s ) equals T a) 3, b) or T k+, subject to whether case a) or b) holds, respectvely k+3 Choose := then we have k + 0) [ apple X d:dd d:dd apple k + 0 ) = 0) whch follows from the fact that there are k k j = k possble decson rules d that satsfy d d snce for every 0 apple apple k there are two choces: REJECT or not REJECT n whch case d) =d )) Thus 4) and 0) mply that 9) s bounded from above by References [] A E Asarn On some propertes of fnte objects random n an algorthmc sense Sovet Mathematcs Doklady, 36):09, 988 [] C K Chow An optmum character recognton system usng decson functons Electronc Computers, IRE Transactons on, EC-64):47 54, 957 [3] A Kontorovch and R Wess Unform chernoff and dvoretzky-kefer-wolfowtz-type nequaltes for markov chans and related processes Journal of Appled robablty, 54):00 3, 04 [4] D W Loveland A new nterpretaton of the von Mses concept of random sequence Zetschrft fur mathematsche Logk und Grundlagen der Mathematk, :79 94, 966 [5] J Ratsaby On a predctor s complexty and the frequency-nstablty of ts mstakes http://wwwarelacl/stes/ratsaby/ublcatons/df/ catalystlearner60pdf, Sept 07