Average Reward Optimization Objective In Partially Observable Domains - Supplementary Material

Similar documents
Math 61CM - Solutions to homework 3

5.1. The Rayleigh s quotient. Definition 49. Let A = A be a self-adjoint matrix. quotient is the function. R(x) = x,ax, for x = 0.

Apply change-of-basis formula to rewrite x as a linear combination of eigenvectors v j.

Stochastic Matrices in a Finite Field

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Eigenvalues and Eigenvectors

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Math Solutions to homework 6

1 Last time: similar and diagonalizable matrices

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Entropy Rates and Asymptotic Equipartition

MATH10212 Linear Algebra B Proof Problems

Achieving Stationary Distributions in Markov Chains. Monday, November 17, 2008 Rice University

Lecture 8: October 20, Applications of SVD: least squares approximation

Notes for Lecture 11

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A.

Generalized Semi- Markov Processes (GSMP)

TENSOR PRODUCTS AND PARTIAL TRACES

7.1 Convergence of sequences of random variables

6.3 Testing Series With Positive Terms

Beurling Integers: Part 2

15.083J/6.859J Integer Optimization. Lecture 3: Methods to enhance formulations

Roger Apéry's proof that zeta(3) is irrational

LECTURE 8: ORTHOGONALITY (CHAPTER 5 IN THE BOOK)

A Hadamard-type lower bound for symmetric diagonally dominant positive matrices

5 Birkhoff s Ergodic Theorem

Week 5-6: The Binomial Coefficients

Integrable Functions. { f n } is called a determining sequence for f. If f is integrable with respect to, then f d does exist as a finite real number

The multiplicative structure of finite field and a construction of LRC

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Assignment 5: Solutions

Singular value decomposition. Mathématiques appliquées (MATH0504-1) B. Dewals, Ch. Geuzaine

A Note on the Symmetric Powers of the Standard Representation of S n

Lecture 7: October 18, 2017

MAT1026 Calculus II Basic Convergence Tests for Series

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory

Sequences and Series of Functions

Optimally Sparse SVMs

Symmetric Matrices and Quadratic Forms

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

4 The Sperner property.

An Introduction to Randomized Algorithms

lim za n n = z lim a n n.

A 2nTH ORDER LINEAR DIFFERENCE EQUATION

Singular Continuous Measures by Michael Pejic 5/14/10

a for a 1 1 matrix. a b a b 2 2 matrix: We define det ad bc 3 3 matrix: We define a a a a a a a a a a a a a a a a a a

MATH 205 HOMEWORK #2 OFFICIAL SOLUTION. (f + g)(x) = f(x) + g(x) = f( x) g( x) = (f + g)( x)

b i u x i U a i j u x i u x j

Supplemental Material: Proofs

Riesz-Fischer Sequences and Lower Frame Bounds

ABOUT CHAOS AND SENSITIVITY IN TOPOLOGICAL DYNAMICS

Resolution Proofs of Generalized Pigeonhole Principles

Assignment 2 Solutions SOLUTION. ϕ 1 Â = 3 ϕ 1 4i ϕ 2. The other case can be dealt with in a similar way. { ϕ 2 Â} χ = { 4i ϕ 1 3 ϕ 2 } χ.

Complex Analysis Spring 2001 Homework I Solution

Analytic Continuation

Convergence of random variables. (telegram style notes) P.J.C. Spreij

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

Zeros of Polynomials

The Discrete Fourier Transform

3.2 Properties of Division 3.3 Zeros of Polynomials 3.4 Complex and Rational Zeros of Polynomials

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

a for a 1 1 matrix. a b a b 2 2 matrix: We define det ad bc 3 3 matrix: We define a a a a a a a a a a a a a a a a a a

SOME TRIBONACCI IDENTITIES

Infinite Sequences and Series

The Perturbation Bound for the Perron Vector of a Transition Probability Tensor

Power series are analytic

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

On the Linear Complexity of Feedback Registers

Notes The Incremental Motion Model:

5.1 Review of Singular Value Decomposition (SVD)

1+x 1 + α+x. x = 2(α x2 ) 1+x

(for homogeneous primes P ) defining global complex algebraic geometry. Definition: (a) A subset V CP n is algebraic if there is a homogeneous

6. Kalman filter implementation for linear algebraic equations. Karhunen-Loeve decomposition

subcaptionfont+=small,labelformat=parens,labelsep=space,skip=6pt,list=0,hypcap=0 subcaption ALGEBRAIC COMBINATORICS LECTURE 8 TUESDAY, 2/16/2016

Markov Decision Processes

CS284A: Representations and Algorithms in Molecular Biology

Linear Regression Demystified

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

A Proof of Birkhoff s Ergodic Theorem

Ma 530 Introduction to Power Series

On forward improvement iteration for stopping problems

(3) If you replace row i of A by its sum with a multiple of another row, then the determinant is unchanged! Expand across the i th row:

Problem Set 2 Solutions

The Jordan Normal Form: A General Approach to Solving Homogeneous Linear Systems. Mike Raugh. March 20, 2005

Additional Notes on Power Series

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

The Growth of Functions. Theoretical Supplement

Chapter 6 Infinite Series

PROBLEM SET 5 SOLUTIONS 126 = , 37 = , 15 = , 7 = 7 1.

Advanced Stochastic Processes.

CHAPTER 5. Theory and Solution Using Matrix Techniques

Notes 27 : Brownian motion: path properties

Continuous Functions

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

arxiv: v1 [math.fa] 3 Apr 2016

Math 451: Euclidean and Non-Euclidean Geometry MWF 3pm, Gasson 204 Homework 3 Solutions

Seunghee Ye Ma 8: Week 5 Oct 28

Measure and Measurable Functions

Orthogonal transformations

Transcription:

Average Reward Optimizatio Objective I Partially Observable Domais - Supplemetary Material Aother example of a cotrolled system (see Sec. 4.2) Figure 3. The first two plots describe the behavior of the system which rotates the state by a 10 ad b 10, with the reset states aliged i a way that takig the opposite actio brigs the system to its topmost poit. The x ad y i the plots represet possible actios such that x y. The bottom plot demostrates how the average reward chages as a fuctio of α ad β, where the policies havig 2 hidde states (S {1, 2}) are parametrized as: P π(a S = 1) = P π(b S = 2) = 1, P π(s = 1 S = 1, a, o 1) = P π(s = 2 S = 2, a, o 1) = α, P π(s = 2 S = 1, a, o 2) = P π(s = 1 S = 2, a, o 2) = α, P π(s = 1 S = 1, b, o 1) = P π(s = 2 S = 2, b, o 1) = β, P π(s = 2 S = 1, b, o 2) = P π(s = 1 S = 2, b, o 2) = β.

Proof of Lemma 1 1.: EE = E E = E is immediate from the defiitio of E. Therefore 1 Et ]E = E for ay, implyig that also E E = E. 2.: ( ] (E E ) = (E EE ) = E (I E ) = E ( 1) )(E i ) i i i=0 ( ( ) ) ] = E I + ( 1) i 1 E = E E, i i=0 where some of the equalities follow from property 1, ad the last equality holds sice the alteratig sum of biomial coefficiets equals to 0. 3.: Sice (E I) has rak 1, the eigespace of E correspodig to the eigevalue 1 is oe dimesioal. So we have a uique ρ such that ρ E = ρ ad ρ 1 = 1. Also, if Ev = v the v is a costat vector because the rows of E sum to 1. Hece also, ρ E = ρ ad E 1 = 1. Now, from property 1 we have E (E I) = 0, meaig that the rows of E are multiples of ρ. Also, we have (E I)E = 0, meaig that the colums of E are costat vectors. Combiig all together we get E = 1ρ. 4.: Observe that (E E ) is Cesaro summable to 0, i.e. 1 (E E ) t = 1 E t ] E 0, due to property 2. So, by Kemey & Sell (1960) (Thm. 1.11.1) ad property 2, 1 I (E E )] 1 = lim k=0 t (E E ) t = I + lim 1 t=1 k=1 t (E k E ). 5.: Follows from replacig Z with the right-had side of Eq. (2). Proof of Theorem 2 The proof is essetially idetical to the proof of Theorem 1 i Schweitzer (1968), which is, however, restricted to Markov chai trasitio matrices. We have, I (E 2 E 1 )Z 1 = (Z 1 1 E 2 + E 1 )Z 1 = (I E 2 + E 1 )Z 1 = (Z 1 2 + E 1 E 2 )Z 1 = Z 1 2 (I + E 1 E 2 )Z 1, where all (except for the last) iequalities follow from the defiitio of Z i. The last equality holds sice Z 1 2 E 1 = E 1 ad Z 1 2 E 2 = E 2, which i tur is due to Z 1 2 havig rows sum to 1, ad property 3 i Lemma 1. Now, observe that (E 2 E 1 ) 2 = 0, meaig that Theorem 1.11.1 i Kemey & Sell (1960) ca be applied o (E 2 E 1 ), ad we get I (E 2 E 1 )] 1 = I E 1 + E 2, which proves Eq. (3). Fially, ρ 1 H 1 2 = ρ 1 Z 1 1 I E 1 + E 2 ]Z 2 = ρ 1 E 2 Z 2 = ρ 2 Z 2 = ρ 2, where ( ) is due to property 3 i Lemma 1 ad ρ 1 1 = 1, while the last equality is due to property 5 i Lemma 1.

Proof of Theorem 3 Average Reward Optimizatio Objective - Supplemetary Material Recall the cocept of a system dyamics matrix (SDM) (Sigh et al., 2004), also kow as predictio matrix (Faigle & Schohuth, 2007). This is a ifiite dimesioal matrix, whose colums correspod to possible histories (ordered i a icreasig legth) ad rows correspod to possible tests (ordered i a icreasig legth). The elemets of this matrix represet coditioal probabilities of observig each test give each history (for more details see Sigh et al. (2004); Faigle & Schohuth (2007)). If the SDM has rak k the the system ca be represeted by a k-dimesioal liear PSR, ad the square submatrix of SDM of dimesio A O k will be of rak k (e.g. see James (2005)). Accordig to Wiewiora (2007), the system resultig from combiig a k-dimesioal liear PSR with cotrol ad a policy with memory of size l ca be represeted with a liear PSR without cotrol whose dimesio is at most k l. We will deote the distributio over future sequeces i such a system with P. Let be the largest rak of a SDM iduced by some policy, implyig that a miimal dimesioal PSR for this system has rak (Wiewiora, 2007). Let g (i) represet the expected state of the system after i time steps, i.e. g (i) is a ifiite-dimesioal vector such that g (i) of colums of SDM, we have that G spa{g (0) (t) = h A O P (t h)p i (h). Sice g (i) are liear combiatios, g(1),...} is at most dimesioal. Let m be the maximum dimesio of G for ay Θ. We will use the followig results from Faigle & Schohuth (2007), which cosider a stochastic process represeted by a OOM for a fixed Θ: 1. If G is m-dimesioal, {g (0),..., g(m 1) } is a basis for G. 2. the state evolutio operator ca be represeted by the matrix E R m m relative to this basis, such that E = 0 0... 0 c (0) 1 0... 0 c (1) 0 1... 0 c (2)...... 0 0... 1 c (m 1) where m 1 j=0 c(j) = 1, ad all c (j) -s are uique. E is the PSR evolutio matrix represeted uder a differet basis sice j N : g (j) = m 1 i=0 a(j) i g (i), where a(j) (a (j) 0,..., a(j) m 1 ) = (1, 0,..., 0) Ej. I other words, it outputs the ext expected state provided the curret state. 3. Furthermore, E lim k k Et exists. The remaiig of the proof cocers the costructio of c (i) -s as fuctios of ad showig that these are well defied ratioal fuctios of for all Θ. Oce this is doe, Theorem 2 ca be applied o two arbitrary matrices E 1 Θ, E 2 Θ, cocludig the proof. Due to the properties of SDM metioed above, we ca idetify ifiite-dimesioal vectors g (i) with their fiite dimesioal couterparts g (i) R A O, where t { A O i } i=1,..., : g (i) (t) = g(i) (t), such that dim spa{g (0) ], g(1),...} = dim, spa{g (0) ], g(1),...}, for all Θ. Note that the same matrix E represets the uique 1 evolutio matrix relative to {g (0),..., g(m 1) }. Let G R A O m be defied as ] G =,..., g(m 1). g (0), g(1) Note that is a direct parametrizatio of a policy, meaig that elemets of G are polyomials i. Thus, Propositio 1 is applicable to G, ad from there we obtai a well defied orthogoal basis {b (0),..., b(m 1) } for 1 As log as {g (0),..., g(m 1) } is liearly idepedet set of vectors.

the colum space of G, such that these basis vectors are ratioal fuctios of. Now we ca defie c (i) i the followig recursive way: ] ] i = m 1 : c (i) = g (m) b (m 1) b (m 1) 2 2 i < m 1 : c (i) = g (m) ] ]. i+1 j=m c(j) g(j) b (i) b (i) 2 2 It is easy to verify that ideed we get g (m) = m 1 i=0 c(i) g(i) for -s for which c (i) -s are well defied. We also kow that these coefficiets are uique as log as G has full rak. Sice these coefficiets are ratioal fuctios of, by the same argumet as i Propositio 1 we get that aroud ay potetial sigularity poit i c (i) we ca fid a sequece of -s covergig to the sigularity poit while c (i) is well defied o this sequece. However, we kow that for all -s where c (i) -s are well defied, they must be bouded. This is due to the fact that E exists for all such -s, implyig that the spectral radius of E is bouded (ad equals to 1). Hece, 0 i < m : c (i) do ot have sigularity poits, or i other words, are well defied for all Θ. Fially, we ote that every E satisfies rak(e I) = 1 by costructio, so the coditios of Theorem 2 for ay pair of these matrices are satisfied. The left eigevector correspodig to eigevalue 1 of E idetifies the statioary mea of the stochastic process S iduced by policy due to the followig: this vector represets the ivariat state of the system with respect to colum vectors i G, which are by themselves liear combiatios of ay fixed colum basis of SDM matrix. Recall that due to Theorem 2, the etries of this eigevector are ratioal fuctios of policy parameters. Therefore, we get that each etry of the statioary distributio of PSR is also a ratioal fuctio of policy parameters. To complete the proof, ote that the statioary distributio of PSR that we have obtaied coicides with the empirical distributio of sequeces observed from data sice the statioary distributio is uique, which it tur is due to the ergodicity assumptio. Propositio 1. Let G R m, m, be a ratioal matrix valued fuctio of Θ, such that G is bouded i Θ, ad Θ be a subspace of some fiite dimesioal Euclidea space. Assume that G has full rak for some Θ. Let {g (i) } 0 i< be the colums of G. The, {b (0),..., b( 1) } defied by i = 0 : i > 0 : b (i) b (i) = g (i) = g (i) i 1 j=0 g (i) ] 2 2 is a well defied orthogoal basis for the colum space of G for all Θ. Proof. Sice G is ratioal i, {b (i) } 0 i< are ratioal fuctios (elemet wise) i. For ay, i particular where the fuctio is ot defied/sigular 2, we ca fid a sequece 1, 2,... where this fuctio is well defied. Suppose that such a sigularity occurs to some elemet of b (i) 0 for some 0 Θ, ad without loss of geerality assume that 0 are well defied for 0 j < i. Sice j : g (j) -s are bouded, at least oe of the elemets of the followig terms (0 j < i) must approach sigularity as 0 : But 2 is bouded aroud 0, leadig to a cotradic- meaig that g (i) tio. g (i) ] 2 2 g (i) ] 2 2 ( = g (i) must have a sigularity poit at 0 sice ±. b(j) ] 2 ) 2,, 2 The fuctio f(x) is sigular at poits {x X limˆx x f(ˆx) = ± }. For ratioal fuctios, these are the roots of the polyomial i the deomiator, which do ot coicide with the roots of the polyomial i the umerator.

Proof of Theorem 4 Average Reward Optimizatio Objective - Supplemetary Material The proof of this theorem uses two results that are give i Propositios 2 ad 3 with their proofs attached. Propositio 2 provides a clear way of defiig parameters of a PSR without cotrol give a PSR (with cotrol) ad the parameters of a fiite memory policy. Propositio 3 shows that oe ca obtai the state of the PSR (with cotrol) from the state of the PSR without cotrol by applyig a liear operator. 1 First, we start by showig that lim R t exists almost surely for ay policy Θ. Recall that we do ot assume that the process is AMS (otherwise there is othig to prove), oly that the process is ergodic. Let R k = k R t, { limk R f( ) k if the limit exists R mi 1 otherwise, { Rk if f( ) R f k ( ) mi R mi 1 otherwise, where R mi is the lowest achievable reward (recall that the reward is bouded). From the defiitio we have that {f k } k f poitwise everywhere. Sice {f k } are uiformly bouded, lim E (f k ) = E (f). k Fially, ote that the set {f( ) = y} is ivariat for ay y because f is ivariat: f(tx) = f(x). By the defiitio of ergodicity, such a set has either probability 1 or 0. Further we show that lim k E (R k ) is well defied, which esures that E (f) = E (lim k R k ) R mi 3, f = E (f) almost surely. Let {w,, W,ao, p,0 } be the miimal PSR without cotrol for some policy, ad let W, ao W,ao be the PSR evolutio matrix. Due to the properties of this matrix discussed earlier we have that W lim W t, exists. 3 Let B idicate the set for which lim k R k exists. The lim k E (R k ) = lim k R B kdp + ] R B k dp = lim k R B kdp + lim k B R kdp = lim k E (R k B)P (B) + lim k B R kdp. Sice both the left had side limit ad the first right had side limit are well defied, the the secod right had side limit must exist as well, implyig that P ( B) = 0.

Therefore, lim k E ( k ) R t = lim 1 = lim E (R t ) = lim t=1 = lim 1 lim r a V t=1 E (R t ) (1) h (t) E(R t h (t) )P ( h (t) ) (2) t=1 h (t 1),a,o t=1 h (t 1),o lim r a p( h (t 1), a, o)p ( h (t 1), a, o) (3) r a V p ( h (t 1), a, o)p ( h (t 1), a, o) (4) t=1 h (t 1),o p ( h (t 1), a, o)p ( h (t 1), a, o) ] r a V lim (p,0w, W t,a ) (5) t=1 ] ) r a V (p,0 lim W, t W,a t=1 r ( a V p,0 W ) W,a = r a V (ρ W,a ) (6) a r W,a a V ρ w, W,a ρ w, W,a ρ r a V ρ (a) P (a) (7) r a V ρ (a) a ρ P ol r a ρ P RP (a) a ρ P ol (8) where (3) follows from the defiitio of the liear PRP; (4) ad (8) are due to Propositio 3 with V beig the liear operator trasformig the state of the PSR without cotrol to the state of the PSR (with cotrol); (5) is by defiitio of a liear PSR; (6) is due to the properties of the PSR evolutio matrix W,, where ρ is the statioary distributio of the PSR without cotrol iduced by policy ; ad ρ (a) i (7) ad (8) represets the state of the PSR without cotrol obtaied by startig from ρ ad takig actio a, while P (a) is the average probability of takig actio a. Fially, ote that ρ ad ρ (a) are ratioal fuctios of Θ by Theorem 3. Hece both ρ P ol ad ρ P RP (a) are ratioal fuctios of as well sice they ca be immediately obtaied from ρ ad ρ (a) through summatio ad divisio (i.e., observe that the PSR state ρ P RP is a vector of coditioal predictios while the PSR without cotrol state ρ is a vector of joit predictios). Propositio 2. Let {m, {M ao } ao A O, p 0, Q = {q 1,..., q }} be a -dimesioal liear PSR (with cotrol) ad {c 0, {T o } o O, {A a } a A } be the represetatio of a stochastic fiite state policy π of size l. Let s 0 p 0 c 0 R l, b m 1 R l, ao A O : B ao M ao A a T o R l l, where represets the Kroecker product. The, the triple {b, {B ao } ao A O, s 0 } satisfies the properties of a PSR without cotrol iduced by the policy π. I particular, P π (a 1 o 1,..., a k o k ) = s 0 B a1o 1 B ak o k b.

Proof. For clarity, here is the descriptio of the meaig of the stochastic fiite state policy parameters: c 0 - the iitial distributio over the states of the policy A a - the diagoal matrix with A a ] ii beig equal to probability of takig actio a i state i T o - the trasitio matrix defiig the state dyamics of the policy correspodig to observig percept o Let C ao A a T o. Recall that for ay history h, P π (o 1:k h, a 1:k ) = p(h) M a1o 1 M ak o k m = p(h) M ao1:k m, P π (a 1:k h, o 1: ) = c(h) A a1 T o1 A ak 1 c(h) C ao1: A ak 1, where 1 is a colum vector of oes of a appropriate size. Observe that By iductio (similarly to Wiewiora (2005)) we get, P π (ao) = P π (a)p π (o a) = c 0 C ao 1 p 0 M ao m. P π (ao 1:k, a k+1, o k+1 ) = P π (ao 1:k ) P π (a k+1 ao 1:k ) P π (o k+1 ao 1:k, a k+1 ) = c 0 C ao1:k 1 p 0 M ao1:k m c(ao 1:k ) A ak+1 1 p(ao 1:k ) M ak+1 o k+1 m = c 0 C ao1:k 1 p 0 M ao1:k m c 0 C ao1:k c 0 C ao 1:k 1 A a k+1 1 p 0 M ao1:k p 0 M M ak+1 o ao 1:k m k+1 m = c 0 C ao1:k A ak+1 T ok+1 1 p 0 M ao1:k+1 m = p 0 M ao1:k+1 m c 0 C ao1:k+1 1, where we used the fact that C ao1:k 1 = C ao1: A ak 1 sice every T o is a trasitio matrix. Now, the first property holds sice s 0 b = p 0 c 0 ] m 1] = p 0 m c 0 1 = 1. The secod property also holds because ] ] B ao b = M ao C ao m 1] = M ao m C ao 1] ao ao ao = M ao m A a 1] = ( ) ] M ao m A a 1 ao a o = ] m A a 1] = m A a 1 = m 1 = b. a a Fially, the last property is due to P π (ao 1:k ) = p 0 M ao1:k m c 0 C ao1:k 1 = p 0 c 0 ] M ao1:k C ao1:k ]m 1] = s 0 B ao1:k b. Propositio 3. Let the actio-observatio stochastic process be geerated from a liear -dimesioal PSR (with cotrol), p, ad a stochastic fiite state cotroller of size m parametrized by Θ. Let p be a miimal liear PSR without cotrol represetig this process. The, the state of PSR (with cotrol), p, ca be obtaied from the state of PSR without cotrol p by applyig a liear operator.

Proof. We will refer to p as both the PSR (with cotrol) ad the state of this PSR (which oe is meat should be clear from the cotext). Equivaletly, we will refer to p as both the PSR without cotrol ad the state of this PSR. First, let {b, B ao, s 0 } be defied as i Propositio 2 from p ad the parameters of the policy represeted as {c 0, {T o } o O, {A a } a A }. Let U = I 1 R m, where I is the -dimesioal idetity matrix. Let s(h) p(h) c(h) be the state of this PSR-like costruct after observig history h. The, Us(h) = I 1 ]s(h) = I 1 ]p(h) c(h)] = p(h) 1 c(h)] = p(h). Therefore we ca recover the state of the eviromet p from the state of our costruct by applyig a liear operator. What is missig yet, is the coectio betwee this costruct ad the PSR without cotrol p. Recall that p is a vector of predictios of some sequeces of observatios give some sequeces of actios. So it is clear that we ca compute p from the joit distributio over actio observatio pairs provided by the state of p, which i geeral is a ratioal fuctio of p. Let V ( ) : p p be such a mappig. Let the dimesio of p be k, ad W be m k matrix whose colums are coefficiets of k core tests with respect to the costruct {b, B ao, s 0 }: s(h) W = p (h) for ay h. Now, we have that for ay h: Us(h) = p(h) = V (s(h) W ) = V (p (h)). From here the coclusio follows. We ca represet the operator V with matrix V R k. V ca be obtaied by solvig the followig system of equatios: V = PP 1, where P R k k represets a collectio (matrix) of k liearly idepedet states (those correspodig to core histories) ad P R k represets a collectio of eviromet states correspodig to the same histories. Proof of Corollary 5 I both represetatios the average reward fuctio is liear i the statioary distributio as a fuctio of the policy parameters, hece the complexity of the statioary distributio gives the upper boud o the complexity of the average reward. Let k be the size of the stochastic fiite state cotrollers we cosider. We aalyze the complexity with respect to the POMDP represetatio first. Observe that the HMM represetig the combied system ad the policy will have km hidde states. Sice the Markov chai defied over the hidde states iduced by ay of the policies of iterest is irreducible due to the ergodicity assumptio, the correspodig trasitio matrix satisfies the coditios of Theorem 1 from Schweitzer (1968), which is a special case of Theorem 2. Each etry of this trasitio matrix is a polyomial of a fixed degree i the parameters of the policy. So are the etries of H 1 1 2 i Theorem 2 whe we fix E 1, Z 1, ρ 1 ad let E 2 be our parametrized trasitio matrix. Oe ca aalytically ivert H 1 1 2 usig Cramer s rule ad observe that each etry of H 1 2 is a ratioal fuctio whose degree is O(km). Therefore, by Theorem 2 the degree of the statioary belief state ρ 2 is also O(km). Now we cosider the reward process beig represeted by a liear PRP. Observe that the liear PSR without cotrol represetig the actio observatio stochastic process iduced by the policy above is at most k dimesioal. Each etry of the SDM describig this actio observatio stochastic process is a ratioal fuctio with the leadig degree equal to the legth of the history plus test. Therefore, the colum vectors stored i G i the proof of Theorem 3 are of degree at most k. The elemets of the basis costructed from the colums of G are also of degree at most O(k) (see Propositio 1), as well as the etries of the bottom row of the evolutio matrix E costructed i Theorem 3. Similarly to a POMDP case, we eed to apply Theorem 2 for some fixed E 1, Z 1, ρ 1, where E 2 E is the fuctio of the policy parameters. Observe that from Theorem 2 we also have H 1 2 = Z 1 1 (Z 1 1 + E 1 E 2 ) 1, where the resultig matrix Z 1 1 + E 1 E 2 has fixed etries i all the rows except for the bottom oe, which is a ratioal fuctio of degree at most O(k). Agai, usig Cramer s rule oe ca ivert the matrix aalytically, which results i H 1 2 havig etries beig ratioal fuctios of degree at most O(k). The, by Theorem 2 the degree of the statioary distributio of the PSR without cotrol, ρ ρ 2, is also at most O(k). Fially, oe ca verify that the quatities ρ P RP (a) ad ρ P ol appearig i Theorem 4 will also have the leadig degree of the

same order. To coclude the proof, recall that the umber of dimesios required to represet the same system i a liear PSR framework () is at most equal to the umber of hidde states i the correspodig POMDP (m), ad ca be, at times sigificatly, smaller (Jaeger, 2000; Littma et al., 2001). Refereces Faigle, U. ad Schohuth, A. Asymptotic mea statioarity of sources with fiite evolutio dimesio. Iformatio Theory, IEEE Trasactios o, 53(7):2342 2348, 2007. Jaeger, H. Observable operator models for discrete stochastic time series. Neural Computatio, 12(6):1371 1398, 2000. James, M.R. Usig predictios for plaig ad modelig i stochastic eviromets. PhD thesis, The Uiversity of Michiga, 2005. Kemey, J.G. ad Sell, J.L. Fiite Markov chais, volume 356. va Nostrad Priceto, NJ, 1960. Littma, M.L., Sutto, R.S., ad Sigh, S. Predictive represetatios of state. Advaces i eural iformatio processig systems, 14:1555 1561, 2001. Schweitzer, P.J. Perturbatio theory ad fiite Markov chais. Joural of Applied Probability, pp. 401 413, 1968. Sigh, S., James, M.R., ad Rudary, M.R. Predictive state represetatios: A ew theory for modelig dyamical systems. I Proceedigs of the 20th coferece o ucertaity i artificial itelligece, pp. 512 519. AUAI Press, 2004. Wiewiora, E. Learig predictive represetatios from a history. I Proceedigs of the 22d iteratioal coferece o machie learig, pp. 964 971. ACM, 2005. Wiewiora, E.W. Modelig probability distributios with predictive state represetatios. PhD thesis, The Uiversity of Califoria at Sa Diego, 2007.