Average Reward Optimizatio Objective I Partially Observable Domais - Supplemetary Material Aother example of a cotrolled system (see Sec. 4.2) Figure 3. The first two plots describe the behavior of the system which rotates the state by a 10 ad b 10, with the reset states aliged i a way that takig the opposite actio brigs the system to its topmost poit. The x ad y i the plots represet possible actios such that x y. The bottom plot demostrates how the average reward chages as a fuctio of α ad β, where the policies havig 2 hidde states (S {1, 2}) are parametrized as: P π(a S = 1) = P π(b S = 2) = 1, P π(s = 1 S = 1, a, o 1) = P π(s = 2 S = 2, a, o 1) = α, P π(s = 2 S = 1, a, o 2) = P π(s = 1 S = 2, a, o 2) = α, P π(s = 1 S = 1, b, o 1) = P π(s = 2 S = 2, b, o 1) = β, P π(s = 2 S = 1, b, o 2) = P π(s = 1 S = 2, b, o 2) = β.
Proof of Lemma 1 1.: EE = E E = E is immediate from the defiitio of E. Therefore 1 Et ]E = E for ay, implyig that also E E = E. 2.: ( ] (E E ) = (E EE ) = E (I E ) = E ( 1) )(E i ) i i i=0 ( ( ) ) ] = E I + ( 1) i 1 E = E E, i i=0 where some of the equalities follow from property 1, ad the last equality holds sice the alteratig sum of biomial coefficiets equals to 0. 3.: Sice (E I) has rak 1, the eigespace of E correspodig to the eigevalue 1 is oe dimesioal. So we have a uique ρ such that ρ E = ρ ad ρ 1 = 1. Also, if Ev = v the v is a costat vector because the rows of E sum to 1. Hece also, ρ E = ρ ad E 1 = 1. Now, from property 1 we have E (E I) = 0, meaig that the rows of E are multiples of ρ. Also, we have (E I)E = 0, meaig that the colums of E are costat vectors. Combiig all together we get E = 1ρ. 4.: Observe that (E E ) is Cesaro summable to 0, i.e. 1 (E E ) t = 1 E t ] E 0, due to property 2. So, by Kemey & Sell (1960) (Thm. 1.11.1) ad property 2, 1 I (E E )] 1 = lim k=0 t (E E ) t = I + lim 1 t=1 k=1 t (E k E ). 5.: Follows from replacig Z with the right-had side of Eq. (2). Proof of Theorem 2 The proof is essetially idetical to the proof of Theorem 1 i Schweitzer (1968), which is, however, restricted to Markov chai trasitio matrices. We have, I (E 2 E 1 )Z 1 = (Z 1 1 E 2 + E 1 )Z 1 = (I E 2 + E 1 )Z 1 = (Z 1 2 + E 1 E 2 )Z 1 = Z 1 2 (I + E 1 E 2 )Z 1, where all (except for the last) iequalities follow from the defiitio of Z i. The last equality holds sice Z 1 2 E 1 = E 1 ad Z 1 2 E 2 = E 2, which i tur is due to Z 1 2 havig rows sum to 1, ad property 3 i Lemma 1. Now, observe that (E 2 E 1 ) 2 = 0, meaig that Theorem 1.11.1 i Kemey & Sell (1960) ca be applied o (E 2 E 1 ), ad we get I (E 2 E 1 )] 1 = I E 1 + E 2, which proves Eq. (3). Fially, ρ 1 H 1 2 = ρ 1 Z 1 1 I E 1 + E 2 ]Z 2 = ρ 1 E 2 Z 2 = ρ 2 Z 2 = ρ 2, where ( ) is due to property 3 i Lemma 1 ad ρ 1 1 = 1, while the last equality is due to property 5 i Lemma 1.
Proof of Theorem 3 Average Reward Optimizatio Objective - Supplemetary Material Recall the cocept of a system dyamics matrix (SDM) (Sigh et al., 2004), also kow as predictio matrix (Faigle & Schohuth, 2007). This is a ifiite dimesioal matrix, whose colums correspod to possible histories (ordered i a icreasig legth) ad rows correspod to possible tests (ordered i a icreasig legth). The elemets of this matrix represet coditioal probabilities of observig each test give each history (for more details see Sigh et al. (2004); Faigle & Schohuth (2007)). If the SDM has rak k the the system ca be represeted by a k-dimesioal liear PSR, ad the square submatrix of SDM of dimesio A O k will be of rak k (e.g. see James (2005)). Accordig to Wiewiora (2007), the system resultig from combiig a k-dimesioal liear PSR with cotrol ad a policy with memory of size l ca be represeted with a liear PSR without cotrol whose dimesio is at most k l. We will deote the distributio over future sequeces i such a system with P. Let be the largest rak of a SDM iduced by some policy, implyig that a miimal dimesioal PSR for this system has rak (Wiewiora, 2007). Let g (i) represet the expected state of the system after i time steps, i.e. g (i) is a ifiite-dimesioal vector such that g (i) of colums of SDM, we have that G spa{g (0) (t) = h A O P (t h)p i (h). Sice g (i) are liear combiatios, g(1),...} is at most dimesioal. Let m be the maximum dimesio of G for ay Θ. We will use the followig results from Faigle & Schohuth (2007), which cosider a stochastic process represeted by a OOM for a fixed Θ: 1. If G is m-dimesioal, {g (0),..., g(m 1) } is a basis for G. 2. the state evolutio operator ca be represeted by the matrix E R m m relative to this basis, such that E = 0 0... 0 c (0) 1 0... 0 c (1) 0 1... 0 c (2)...... 0 0... 1 c (m 1) where m 1 j=0 c(j) = 1, ad all c (j) -s are uique. E is the PSR evolutio matrix represeted uder a differet basis sice j N : g (j) = m 1 i=0 a(j) i g (i), where a(j) (a (j) 0,..., a(j) m 1 ) = (1, 0,..., 0) Ej. I other words, it outputs the ext expected state provided the curret state. 3. Furthermore, E lim k k Et exists. The remaiig of the proof cocers the costructio of c (i) -s as fuctios of ad showig that these are well defied ratioal fuctios of for all Θ. Oce this is doe, Theorem 2 ca be applied o two arbitrary matrices E 1 Θ, E 2 Θ, cocludig the proof. Due to the properties of SDM metioed above, we ca idetify ifiite-dimesioal vectors g (i) with their fiite dimesioal couterparts g (i) R A O, where t { A O i } i=1,..., : g (i) (t) = g(i) (t), such that dim spa{g (0) ], g(1),...} = dim, spa{g (0) ], g(1),...}, for all Θ. Note that the same matrix E represets the uique 1 evolutio matrix relative to {g (0),..., g(m 1) }. Let G R A O m be defied as ] G =,..., g(m 1). g (0), g(1) Note that is a direct parametrizatio of a policy, meaig that elemets of G are polyomials i. Thus, Propositio 1 is applicable to G, ad from there we obtai a well defied orthogoal basis {b (0),..., b(m 1) } for 1 As log as {g (0),..., g(m 1) } is liearly idepedet set of vectors.
the colum space of G, such that these basis vectors are ratioal fuctios of. Now we ca defie c (i) i the followig recursive way: ] ] i = m 1 : c (i) = g (m) b (m 1) b (m 1) 2 2 i < m 1 : c (i) = g (m) ] ]. i+1 j=m c(j) g(j) b (i) b (i) 2 2 It is easy to verify that ideed we get g (m) = m 1 i=0 c(i) g(i) for -s for which c (i) -s are well defied. We also kow that these coefficiets are uique as log as G has full rak. Sice these coefficiets are ratioal fuctios of, by the same argumet as i Propositio 1 we get that aroud ay potetial sigularity poit i c (i) we ca fid a sequece of -s covergig to the sigularity poit while c (i) is well defied o this sequece. However, we kow that for all -s where c (i) -s are well defied, they must be bouded. This is due to the fact that E exists for all such -s, implyig that the spectral radius of E is bouded (ad equals to 1). Hece, 0 i < m : c (i) do ot have sigularity poits, or i other words, are well defied for all Θ. Fially, we ote that every E satisfies rak(e I) = 1 by costructio, so the coditios of Theorem 2 for ay pair of these matrices are satisfied. The left eigevector correspodig to eigevalue 1 of E idetifies the statioary mea of the stochastic process S iduced by policy due to the followig: this vector represets the ivariat state of the system with respect to colum vectors i G, which are by themselves liear combiatios of ay fixed colum basis of SDM matrix. Recall that due to Theorem 2, the etries of this eigevector are ratioal fuctios of policy parameters. Therefore, we get that each etry of the statioary distributio of PSR is also a ratioal fuctio of policy parameters. To complete the proof, ote that the statioary distributio of PSR that we have obtaied coicides with the empirical distributio of sequeces observed from data sice the statioary distributio is uique, which it tur is due to the ergodicity assumptio. Propositio 1. Let G R m, m, be a ratioal matrix valued fuctio of Θ, such that G is bouded i Θ, ad Θ be a subspace of some fiite dimesioal Euclidea space. Assume that G has full rak for some Θ. Let {g (i) } 0 i< be the colums of G. The, {b (0),..., b( 1) } defied by i = 0 : i > 0 : b (i) b (i) = g (i) = g (i) i 1 j=0 g (i) ] 2 2 is a well defied orthogoal basis for the colum space of G for all Θ. Proof. Sice G is ratioal i, {b (i) } 0 i< are ratioal fuctios (elemet wise) i. For ay, i particular where the fuctio is ot defied/sigular 2, we ca fid a sequece 1, 2,... where this fuctio is well defied. Suppose that such a sigularity occurs to some elemet of b (i) 0 for some 0 Θ, ad without loss of geerality assume that 0 are well defied for 0 j < i. Sice j : g (j) -s are bouded, at least oe of the elemets of the followig terms (0 j < i) must approach sigularity as 0 : But 2 is bouded aroud 0, leadig to a cotradic- meaig that g (i) tio. g (i) ] 2 2 g (i) ] 2 2 ( = g (i) must have a sigularity poit at 0 sice ±. b(j) ] 2 ) 2,, 2 The fuctio f(x) is sigular at poits {x X limˆx x f(ˆx) = ± }. For ratioal fuctios, these are the roots of the polyomial i the deomiator, which do ot coicide with the roots of the polyomial i the umerator.
Proof of Theorem 4 Average Reward Optimizatio Objective - Supplemetary Material The proof of this theorem uses two results that are give i Propositios 2 ad 3 with their proofs attached. Propositio 2 provides a clear way of defiig parameters of a PSR without cotrol give a PSR (with cotrol) ad the parameters of a fiite memory policy. Propositio 3 shows that oe ca obtai the state of the PSR (with cotrol) from the state of the PSR without cotrol by applyig a liear operator. 1 First, we start by showig that lim R t exists almost surely for ay policy Θ. Recall that we do ot assume that the process is AMS (otherwise there is othig to prove), oly that the process is ergodic. Let R k = k R t, { limk R f( ) k if the limit exists R mi 1 otherwise, { Rk if f( ) R f k ( ) mi R mi 1 otherwise, where R mi is the lowest achievable reward (recall that the reward is bouded). From the defiitio we have that {f k } k f poitwise everywhere. Sice {f k } are uiformly bouded, lim E (f k ) = E (f). k Fially, ote that the set {f( ) = y} is ivariat for ay y because f is ivariat: f(tx) = f(x). By the defiitio of ergodicity, such a set has either probability 1 or 0. Further we show that lim k E (R k ) is well defied, which esures that E (f) = E (lim k R k ) R mi 3, f = E (f) almost surely. Let {w,, W,ao, p,0 } be the miimal PSR without cotrol for some policy, ad let W, ao W,ao be the PSR evolutio matrix. Due to the properties of this matrix discussed earlier we have that W lim W t, exists. 3 Let B idicate the set for which lim k R k exists. The lim k E (R k ) = lim k R B kdp + ] R B k dp = lim k R B kdp + lim k B R kdp = lim k E (R k B)P (B) + lim k B R kdp. Sice both the left had side limit ad the first right had side limit are well defied, the the secod right had side limit must exist as well, implyig that P ( B) = 0.
Therefore, lim k E ( k ) R t = lim 1 = lim E (R t ) = lim t=1 = lim 1 lim r a V t=1 E (R t ) (1) h (t) E(R t h (t) )P ( h (t) ) (2) t=1 h (t 1),a,o t=1 h (t 1),o lim r a p( h (t 1), a, o)p ( h (t 1), a, o) (3) r a V p ( h (t 1), a, o)p ( h (t 1), a, o) (4) t=1 h (t 1),o p ( h (t 1), a, o)p ( h (t 1), a, o) ] r a V lim (p,0w, W t,a ) (5) t=1 ] ) r a V (p,0 lim W, t W,a t=1 r ( a V p,0 W ) W,a = r a V (ρ W,a ) (6) a r W,a a V ρ w, W,a ρ w, W,a ρ r a V ρ (a) P (a) (7) r a V ρ (a) a ρ P ol r a ρ P RP (a) a ρ P ol (8) where (3) follows from the defiitio of the liear PRP; (4) ad (8) are due to Propositio 3 with V beig the liear operator trasformig the state of the PSR without cotrol to the state of the PSR (with cotrol); (5) is by defiitio of a liear PSR; (6) is due to the properties of the PSR evolutio matrix W,, where ρ is the statioary distributio of the PSR without cotrol iduced by policy ; ad ρ (a) i (7) ad (8) represets the state of the PSR without cotrol obtaied by startig from ρ ad takig actio a, while P (a) is the average probability of takig actio a. Fially, ote that ρ ad ρ (a) are ratioal fuctios of Θ by Theorem 3. Hece both ρ P ol ad ρ P RP (a) are ratioal fuctios of as well sice they ca be immediately obtaied from ρ ad ρ (a) through summatio ad divisio (i.e., observe that the PSR state ρ P RP is a vector of coditioal predictios while the PSR without cotrol state ρ is a vector of joit predictios). Propositio 2. Let {m, {M ao } ao A O, p 0, Q = {q 1,..., q }} be a -dimesioal liear PSR (with cotrol) ad {c 0, {T o } o O, {A a } a A } be the represetatio of a stochastic fiite state policy π of size l. Let s 0 p 0 c 0 R l, b m 1 R l, ao A O : B ao M ao A a T o R l l, where represets the Kroecker product. The, the triple {b, {B ao } ao A O, s 0 } satisfies the properties of a PSR without cotrol iduced by the policy π. I particular, P π (a 1 o 1,..., a k o k ) = s 0 B a1o 1 B ak o k b.
Proof. For clarity, here is the descriptio of the meaig of the stochastic fiite state policy parameters: c 0 - the iitial distributio over the states of the policy A a - the diagoal matrix with A a ] ii beig equal to probability of takig actio a i state i T o - the trasitio matrix defiig the state dyamics of the policy correspodig to observig percept o Let C ao A a T o. Recall that for ay history h, P π (o 1:k h, a 1:k ) = p(h) M a1o 1 M ak o k m = p(h) M ao1:k m, P π (a 1:k h, o 1: ) = c(h) A a1 T o1 A ak 1 c(h) C ao1: A ak 1, where 1 is a colum vector of oes of a appropriate size. Observe that By iductio (similarly to Wiewiora (2005)) we get, P π (ao) = P π (a)p π (o a) = c 0 C ao 1 p 0 M ao m. P π (ao 1:k, a k+1, o k+1 ) = P π (ao 1:k ) P π (a k+1 ao 1:k ) P π (o k+1 ao 1:k, a k+1 ) = c 0 C ao1:k 1 p 0 M ao1:k m c(ao 1:k ) A ak+1 1 p(ao 1:k ) M ak+1 o k+1 m = c 0 C ao1:k 1 p 0 M ao1:k m c 0 C ao1:k c 0 C ao 1:k 1 A a k+1 1 p 0 M ao1:k p 0 M M ak+1 o ao 1:k m k+1 m = c 0 C ao1:k A ak+1 T ok+1 1 p 0 M ao1:k+1 m = p 0 M ao1:k+1 m c 0 C ao1:k+1 1, where we used the fact that C ao1:k 1 = C ao1: A ak 1 sice every T o is a trasitio matrix. Now, the first property holds sice s 0 b = p 0 c 0 ] m 1] = p 0 m c 0 1 = 1. The secod property also holds because ] ] B ao b = M ao C ao m 1] = M ao m C ao 1] ao ao ao = M ao m A a 1] = ( ) ] M ao m A a 1 ao a o = ] m A a 1] = m A a 1 = m 1 = b. a a Fially, the last property is due to P π (ao 1:k ) = p 0 M ao1:k m c 0 C ao1:k 1 = p 0 c 0 ] M ao1:k C ao1:k ]m 1] = s 0 B ao1:k b. Propositio 3. Let the actio-observatio stochastic process be geerated from a liear -dimesioal PSR (with cotrol), p, ad a stochastic fiite state cotroller of size m parametrized by Θ. Let p be a miimal liear PSR without cotrol represetig this process. The, the state of PSR (with cotrol), p, ca be obtaied from the state of PSR without cotrol p by applyig a liear operator.
Proof. We will refer to p as both the PSR (with cotrol) ad the state of this PSR (which oe is meat should be clear from the cotext). Equivaletly, we will refer to p as both the PSR without cotrol ad the state of this PSR. First, let {b, B ao, s 0 } be defied as i Propositio 2 from p ad the parameters of the policy represeted as {c 0, {T o } o O, {A a } a A }. Let U = I 1 R m, where I is the -dimesioal idetity matrix. Let s(h) p(h) c(h) be the state of this PSR-like costruct after observig history h. The, Us(h) = I 1 ]s(h) = I 1 ]p(h) c(h)] = p(h) 1 c(h)] = p(h). Therefore we ca recover the state of the eviromet p from the state of our costruct by applyig a liear operator. What is missig yet, is the coectio betwee this costruct ad the PSR without cotrol p. Recall that p is a vector of predictios of some sequeces of observatios give some sequeces of actios. So it is clear that we ca compute p from the joit distributio over actio observatio pairs provided by the state of p, which i geeral is a ratioal fuctio of p. Let V ( ) : p p be such a mappig. Let the dimesio of p be k, ad W be m k matrix whose colums are coefficiets of k core tests with respect to the costruct {b, B ao, s 0 }: s(h) W = p (h) for ay h. Now, we have that for ay h: Us(h) = p(h) = V (s(h) W ) = V (p (h)). From here the coclusio follows. We ca represet the operator V with matrix V R k. V ca be obtaied by solvig the followig system of equatios: V = PP 1, where P R k k represets a collectio (matrix) of k liearly idepedet states (those correspodig to core histories) ad P R k represets a collectio of eviromet states correspodig to the same histories. Proof of Corollary 5 I both represetatios the average reward fuctio is liear i the statioary distributio as a fuctio of the policy parameters, hece the complexity of the statioary distributio gives the upper boud o the complexity of the average reward. Let k be the size of the stochastic fiite state cotrollers we cosider. We aalyze the complexity with respect to the POMDP represetatio first. Observe that the HMM represetig the combied system ad the policy will have km hidde states. Sice the Markov chai defied over the hidde states iduced by ay of the policies of iterest is irreducible due to the ergodicity assumptio, the correspodig trasitio matrix satisfies the coditios of Theorem 1 from Schweitzer (1968), which is a special case of Theorem 2. Each etry of this trasitio matrix is a polyomial of a fixed degree i the parameters of the policy. So are the etries of H 1 1 2 i Theorem 2 whe we fix E 1, Z 1, ρ 1 ad let E 2 be our parametrized trasitio matrix. Oe ca aalytically ivert H 1 1 2 usig Cramer s rule ad observe that each etry of H 1 2 is a ratioal fuctio whose degree is O(km). Therefore, by Theorem 2 the degree of the statioary belief state ρ 2 is also O(km). Now we cosider the reward process beig represeted by a liear PRP. Observe that the liear PSR without cotrol represetig the actio observatio stochastic process iduced by the policy above is at most k dimesioal. Each etry of the SDM describig this actio observatio stochastic process is a ratioal fuctio with the leadig degree equal to the legth of the history plus test. Therefore, the colum vectors stored i G i the proof of Theorem 3 are of degree at most k. The elemets of the basis costructed from the colums of G are also of degree at most O(k) (see Propositio 1), as well as the etries of the bottom row of the evolutio matrix E costructed i Theorem 3. Similarly to a POMDP case, we eed to apply Theorem 2 for some fixed E 1, Z 1, ρ 1, where E 2 E is the fuctio of the policy parameters. Observe that from Theorem 2 we also have H 1 2 = Z 1 1 (Z 1 1 + E 1 E 2 ) 1, where the resultig matrix Z 1 1 + E 1 E 2 has fixed etries i all the rows except for the bottom oe, which is a ratioal fuctio of degree at most O(k). Agai, usig Cramer s rule oe ca ivert the matrix aalytically, which results i H 1 2 havig etries beig ratioal fuctios of degree at most O(k). The, by Theorem 2 the degree of the statioary distributio of the PSR without cotrol, ρ ρ 2, is also at most O(k). Fially, oe ca verify that the quatities ρ P RP (a) ad ρ P ol appearig i Theorem 4 will also have the leadig degree of the
same order. To coclude the proof, recall that the umber of dimesios required to represet the same system i a liear PSR framework () is at most equal to the umber of hidde states i the correspodig POMDP (m), ad ca be, at times sigificatly, smaller (Jaeger, 2000; Littma et al., 2001). Refereces Faigle, U. ad Schohuth, A. Asymptotic mea statioarity of sources with fiite evolutio dimesio. Iformatio Theory, IEEE Trasactios o, 53(7):2342 2348, 2007. Jaeger, H. Observable operator models for discrete stochastic time series. Neural Computatio, 12(6):1371 1398, 2000. James, M.R. Usig predictios for plaig ad modelig i stochastic eviromets. PhD thesis, The Uiversity of Michiga, 2005. Kemey, J.G. ad Sell, J.L. Fiite Markov chais, volume 356. va Nostrad Priceto, NJ, 1960. Littma, M.L., Sutto, R.S., ad Sigh, S. Predictive represetatios of state. Advaces i eural iformatio processig systems, 14:1555 1561, 2001. Schweitzer, P.J. Perturbatio theory ad fiite Markov chais. Joural of Applied Probability, pp. 401 413, 1968. Sigh, S., James, M.R., ad Rudary, M.R. Predictive state represetatios: A ew theory for modelig dyamical systems. I Proceedigs of the 20th coferece o ucertaity i artificial itelligece, pp. 512 519. AUAI Press, 2004. Wiewiora, E. Learig predictive represetatios from a history. I Proceedigs of the 22d iteratioal coferece o machie learig, pp. 964 971. ACM, 2005. Wiewiora, E.W. Modelig probability distributios with predictive state represetatios. PhD thesis, The Uiversity of Califoria at Sa Diego, 2007.