Average Reward Optimization Objective In Partially Observable Domains

Size: px
Start display at page:

Download "Average Reward Optimization Objective In Partially Observable Domains"

Transcription

1 Average Reward Optimization Objective In Partially Observable Domains Yuri Grinberg School of Computer Science, McGill University, Canada Doina Precup School of Computer Science, McGill University, Canada Abstract We consider the problem of average reward optimization in domains with partial observability, within the modeling framework of linear predictive state representations (PSRs) (Littman et al., 2001). The key to average-reward computation is to have a welldefined stationary behavior of a system, so the required averages can be computed. If, additionally, the stationary behavior varies smoothly with changes in policy parameters, average-reward control through policy search also becomes a possibility. In this paper, we show that PSRs have a well-behaved stationary distribution, which is a rational function of policy parameters. Based on this result, we define a related reward process particularly suitable for average reward optimization, and analyze its properties. We show that in such a predictive state reward process, the average reward is a rational function of the policy parameters, whose complexity depends on the dimension of the underlying linear PSR. This result suggests that average reward-based policy search methods can be effective when the dimension of the system is small, even when the system representation in the POMDP framework requires many hidden states. We provide illustrative examples of this type. 1. Introduction Partial observability is prevalent in practical domains, in which one has to model systems based on noisy Proceedings of the 30 th International Conference on Machine Learning, Atlanta, Georgia, USA, JMLR: W&CP volume 28. Copyright 2013 by the author(s). observations. In domains with partial observability and finite action and observation spaces, a popular framework for modeling and control is that of finite state partially observable Markov decision processes (POMDP) (Kaelbling et al., 1998). POMDPs generalize the well-known framework of Markov decision processes (MDPs); hence, they inherit many useful properties from MDPs. In particular, the results of Schweitzer (1968) imply that the stationary distribution of an ergodic Markov chain is a rational function of the changes in the transition matrix. Under mild conditions, this result can be applied immediately to finite state POMDPs: the stationary distribution of the Markov chain induced by some finite memory policy exists and is a rational function of policy parameters. This is important when one has to estimate expectations of different quantities (such as returns) with respect to the policy, since these expectations are taken with respect to the stationary behavior of the process. In particular, one can show that the average reward in finite state POMDPs is a linear function of the stationary distribution. Thus, knowing that the stationary distribution is robust to small changes in the policy implies that the estimates of different quantities based on the stationary distribution are robust as well. About a decade ago, a new modeling framework for finite action and observation spaces was introduced, named (linear) predictive state representations (PSR) (Littman et al., 2001; Singh et al., 2004). PSRs alleviate model learning challenges encountered in finite state POMDP models, because their state representation is grounded in measurable quantities. More importantly, PSRs are capable of representing some POMDPs with smaller dimension than the number of hidden states. Moreover, there are systems which can be represented by a linear PSR but not a POMDP (Jaeger, 2000).

2 PSRs are formulated in terms of actions and observations, but rewards do not play as special role in the original framework, as they do in MDPs and POMDPs. The planning problem using linear PSRs has already been tackled by several authors, e.g. James et al. (2004); Boots et al. (2010); Izadi & Precup (2003) but little was said about how their specific assumptions on the reward function affect the combined reward-psr process. The main motivation for our work is to provide a theoretical framework for reward processes based on PSRs, and analyze the behavior of the average reward in this framework. There are two practical implications of this idea. First, this would enable the development of learning and planning algorithms which are directly based on evaluating reward averages, rather than learning a full predictive representation first, and then using it to estimate policy values. In the MDP and POMDP framework, methods based on value functions and/or policy search are known to scale better to very large problems than methods based on a full model estimation. We would like to bring this advantage to linear PSRs as well. Second, defining an appropriate reward process based on PSRs enables an easier theoretical analysis of any control methods. In Section 4, we present a (linear) predictive state reward process (PRP), which is built on top of a (linear) PSR and provides a suitable framework for average reward optimization. We analyze some of its properties and discuss its relationship to the POMDP framework. Our most important result connects the size of the linear PSR representation underlying the linear PRP to the complexity of the average reward as a function of policy parameters. We prove that a compact PSR representation will induce a reward process with a simple average reward function. We provide some examples in which the linear PRP representation is much smaller than the corresponding POMDP representation; hence, using the PRP for policy search, for example, could potentially be much easier and yield a solution faster. The key ingredient in analyzing the PRP is to analyze the behavior of its stationary distribution as a function of the policy used, similar to the case of a POMDP. This stationary distribution is in fact the stationary distribution of the underlying linear PSR, which has been shown to exist (Faigle & Schonhuth, 2007) given a fixed policy. However, its form has not been stated before. In Section 3, we show that the stationary distribution induced by a finite memory policy in a linear PSR is a rational function of the policy parameters, as long as the process is ergodic, similar to a finite state POMDP framework. This result is a novel contribution in itself, and can be useful if a stability / perturbation analysis of a linear PSR is desired. Throughout the paper, we omit the proofs, for brevity; they are included in the appendix provided as supplementary material. 2. Background and notation We consider systems with control having finite action and observation/percept spaces, denoted as A and O respectively, and rewards coming from a bounded set R R. We define a reward process as a tuple (Ω, F, T, µ 0 ), where (Ω, F) is a measurable space, µ 0 is a probability kernel representing the effect of actions, and T is a shift operator. In our setting, the elements of Ω can be viewed as infinite sequences of action-observation-reward triples, ω Ω : ω = a 1, o 1, r 1, a 2, o 2, r 2,..., a i A, o i O, r i R. In this setting, T is defined by: T( a 1, o 1, r 1, a 2, o 2, r 2,... ) = a 2, o 2, r 2, a 3, o 3, r 3..., T 1 (ω) = {ω T(ω ) = ω}. At each time step, the reward process waits for the agent to take an action and outputs an observationreward pair. We are interested in the average reward setting, where the goal is to find a behavior (or policy) π maximizing lim n n t=1 R t where R t is the 1 n random variable denoting the reward observed at time t, when following policy π. The average reward setting is in fact preferable to discounted rewards in partially observable systems when the dynamics are not known 1 (Singh et al., 1994). We consider the classical policy definition, in which the action choice depends (stochastically) only on actions and observations, but not rewards: k N, π : A, O k [0, 1] A. Representing any arbitrary policy of this kind is infeasible, since it requires infinite memory. Therefore, only policies having a finite memory representation are considered. Such policies can be implemented through (stochastic) finite state controllers, whose parameters represent the transition probabilities between internal states of the policy, and the probabilities of actions conditioned on the policy state. We call this a direct policy parametrization. In particular, if the policy is memoryless and open-loop (i.e., it chooses actions with a fixed distribution regardless of the time step and history), its direct parametrization will simply be the vector representing the distribution over actions. 1 See also Sutton (1998), last bullet, for details; this addresses the function approximation in MDPs, which is essentially equivalent to POMDPs.

3 Fixing a policy π generates a stochastic process whose elements are action-observation-reward triples. We denote the distribution of this process by P π or P θ where θ is the parametrization of π, and the expectation with respect to P π by E π or E θ. P π is asymptotically mean stationary (AMS) (Gray & Kieffer, 1980) if F F : P n 1 1 π (F ) = lim P π (T i F ) n n i=0 exists; P π is called stationary mean; let E π be the expectation with respect to P π. The AMS property is a necessary and sufficient condition for the running averages of any bounded quantity to converge; in particular, it guarantees the existence of the average reward. For example, if the reward process is a finite state Markov decision process (MDP), the stationary mean is the stationary distribution of the Markov chain induced by policy π. Moreover, P π is ergodic if G G : P π (G) {0, 1}, where G F is a set of invariant events (T 1 G = G). Thus, if P π is AMS and ergodic, 1 n lim R t = E π (R) a.s. (P π, n n P π ), t=1 meaning that the average reward is constant almost surely (for more details see Gray (2009)). In a finite state MDP, the ergodicity of the process corresponds to the Markov chain induced by π being irreducible Predictive State Representations The idea of a predictive state of a system is rooted in the following observation: knowing the distribution of any finite sequence of observations given any sequence of actions describes completely and accurately the future dynamics of that system. In the setting we consider, the action and observation spaces are finite, meaning that one can represent the predictive state of the system with a countably infinite vector enumerating probabilities of finite sequences of observations given sequences of actions. We will refer to this vector as the state of the system. In general, it might not be possible to represent this vector concisely. In Littman et al. (2001), a new representation for the action-observation controllable process (defined without rewards) was proposed, called predictive state representation (PSR). It captures processes in which predictions of a finite number of sequences are enough to compute the prediction of any other sequence. Formally, let h, t A O be sequences of actionobservation pairs of finite length, where h is a history, i.e., a sequence observed until now, and t is a sequence that might occur in the future, called test. Let Q = {q 1,..., q n } be a set of tests, called core tests, and p(h) [P psr (q 1 h),..., P psr (q n h)] be the prediction vector where P psr (q h) µ0(hq) µ 0(h) represents the probability of observing percepts from q given that the agent has seen history h and is going to take actions from q. Then, p(h) is a predictive state representation of dimension n if and only if P psr (t h) = f t (p(h)), (1) for any test t and history h, where f t : [0, 1] n [0, 1] is any function independent of history. If f t is linear for all t, then the representation is called linear PSR, and P psr (t h) = m t p(h), where the function f t is replaced by a linear projection m t. It has been shown (Singh et al., 2004; Wiewiora, 2007) that a linear PSR possesses a finite parametrization of the form {m R n, {M ao R n n } a A,o O, p 0 R n }, satisfying the following properties: 1. m p 0 = 1, 2. a A : [ o O M ao] m = m, 3. P psr (a 1, o 1,..., a k, o k ) = m M a k,o k M a 1,o 1 p 0, where p 0 is the starting state of the linear PSR. Using property 3 one can verify that p(hao) M aop(h) m M aop(h) represents the PSR state after observing ao, given the previous PSR state p(h). Predictions of future sequences can be calculated using the new PSR state and the rest of the linear PSR parameters only. For better readability, from now on we drop the prefix linear from linear PSR and explicitly state non-linear when referring to the general PSR framework. Any finite memory policy π in a PSR induces an action-observation stochastic process, which can also be represented by a PSR, possibly of a larger dimension (Wiewiora, 2007). This process is often called PSR without control. In PSRs without control, property 2 reduces to [ ao A O M ao] m = m, and in property 3, P psr is replaced with P π, where π is the policy at hand. A PSR without control can be represented in at least two more frameworks: observable operator models (OOM) (Jaeger, 2000) and transformed PSRs (TPSR) (Boots et al., 2010; Rosencrantz

4 et al., 2004). These frameworks can be seen as different parametrizations of the same process, while keeping the same dimension to represent its state (Singh et al., 2004; Boots et al., 2010; James, 2005). Therefore, one can use any of these frameworks equivalently to analyze properties of the system. In particular, our analysis uses results obtained under the OOM framework (Faigle & Schonhuth, 2007). Finally, the key property of these frameworks is that certain systems can be represented more compactly compared to the classical finite state hidden Markov model (HMM) representation (Jaeger, 2000). Moreover, some of these systems might not be representable using finite-state HMMs at all. This is the reason why we focus on this type of representation. 3. The stationary distribution of a PSR In this section, we investigate how the stationary distribution of a PSR behaves as the policy changes, where by stationary distribution we mean the stationary state of the system represented by a PSR. For this purpose, we focus on the stationary distribution of a PSR without control induced by some policy, and see how changes in this policy affect this distribution. Let ρ π be the stationary distribution of a PSR without control induced by policy π 2. Let t N : p t be the expected state of the PSR at time t. Assuming that p 0 = ρ π, we have, by definition: Now, let t N : p t = ρ π. M ao A O M ao be the PSR evolution matrix. This is called evolution matrix due to the following, which can be easily seen from property 3: t N : p t = M p t 1. The matrix M will in fact define the stationary distribution of the PSR. Under the appropriate basis, it satisfies the conditions of Lemma 1 (Faigle & Schonhuth, 2007), and it can be seen as a generalized Markov transition matrix. More precisely, its spectral properties are the same as those of a usual Markov transition matrix, and the rows sum to 1, but some of the values might be negative (see Faigle & Schonhuth (2007) Section IV). The following Lemma shows that several important properties of Markov transition matrices generalize to evolution matrices as well (represented under the appropriate basis). 2 Note that ρ π represents P π, the stationary mean of the action-observation stochastic process. Lemma 1. Let E R n n be such that n 1 E 1 lim n n t=0 exists, and the rows of E sum to 1. Also assume that (E I) has rank n 1, where I is an identity matrix of appropriate size. Then the following holds: 1. EE = E E = E E = E. 2. (E E ) n = E n E. 3. E = 1ρ, where 1 is a column vector of ones; ρ is the unique column vector satisfying ρ E = ρ, ρ 1 = Let Z [I (E E )] 1, then Z is well defined and n 1 1 t Z = I + lim (E k E ). (2) n n 5. ρ Z = ρ. t=1 k=1 The next theorem extends Theorem 1 of Schweitzer (1968) to the case in which the two matrices at hand are evolution matrices. Theorem 2. Let E 1, E 2 be as in Lemma 1, and let E 1, E 2, ρ 1, ρ 2, Z 1, Z 2 be the quantities defined as in Lemma 1 with respect to E 1 and E 2 respectively. Then: E t H 1 2 [I (E 2 E 1 )Z 1 ] 1 (3) exists and is given by H 1 2 = Z 1 1 [I E 1 + E 2 ]Z 2. Moreover, ρ 1 H 1 2 = ρ 2. Theorem 2 lies in the heart of our work, since it establishes an exact relationship between the eigenvectors corresponding to eigenvalue 1 of two evolution matrices. Similarly to the theory of Markov chains, these eigenvectors represent the stationary distributions of the corresponding PSRs. Specifically, let E 1, Z 1 and ρ 1 be fixed, while E 2 varies. Theorem 2 shows that ρ 2 is a rational function of the entries in E 2. This is due to the fact that the matrix inverse can be calculated analytically through Cramer s rule. The next theorem is the main result of this section. Theorem 3. Let an action-observation controllable process S be representable by a finite dimensional linear PSR, and let θ Θ be a direct parametrization of a stochastic finite memory policy. If the stochastic process generated from S induced by any policy θ Θ is ergodic, then the stationary state of S is a rational function of θ.

5 The proof (included in the supplementary material) shows how to construct the parametrized evolution matrix in an appropriate basis, given a PSR without control parametrized by a policy. The construction guarantees that the entries of the evolution matrix are rational functions of the policy parameters. We show that this construction is well defined, and then invoke Theorem 2 to complete the proof. As mentioned, Theorem 3 requires the ergodicity condition for any policy under consideration. A violation of this condition implies the existence of a policy for which the initial state of the system affects its asymptotic behavior. In the theory of Markov chains, for example, this translates into the irreducibility condition on the chains induced by any policy from the set of policies considered. In particular, the irreducibility condition(s) is either satisfied for all strictly stochastic policies or not satisfied for any policy. Hence, this assumption is considered standard in the policy search literature (Baxter & Bartlett, 2001; Peters & Schaal, 2008). 4. Predictive-state reward processes In this section we define a (not necessarily linear) PSRbased reward process. The goal is to construct a reward process that inherits some of the useful properties of the PSR framework on one hand, while being general enough and suitable for average reward optimization on the other hand. In particular, we let the reward be continuous. Although we keep the definition quite general, we are interested in systems that are based on the linear PSR framework, and only this setting is treated afterwards. Let h (t) represent the sequence of action-observationreward triplets of length t, and h (t) represent the corresponding sequence of action-observation pairs only. Definition 1. A predictive state reward process (PRP) is a reward process whose action-observation component can be represented by a (possibly nonlinear) PSR with some state vector s R n, and whose reward R t at time t satisfies t N: E[R t h (t 1), a t, o t ] = E[R s( h (t) ), a t ] f R,at [s( h (t) )], where s( h (t) ) is the PSR state after observing h (t), and {f R,a } a A are functions independent of h (t). A linear PRP is a PRP whose action-observation component can be represented by a linear PSR, and in which f R,at are linear functions: a t A : f R,at [s( h (t) )] = r a t s( h (t) ), where a A : r a R n. The dimension of the PRP is defined to be the dimension of the underlying PSR. Note that Definition 1 in fact formalizes a setting under which most of the planning problem in linear PSRs has been tackled. For example, in Boots et al. (2010) the reward function is obtained by applying a linear regression from PSR states to the observed rewards. James et al. (2004) assumes that the rewards are discrete and incorporates them directly into the observation vector. Izadi & Precup (2003) defines the reward signal itself to be a linear function of the state of a linear PSR. Thus, current PSR based planning algorithms either implicitly assume that the underlying reward process is a linear PRP, or have more restrictive explicit assumptions. The rest of this section is devoted to the analysis of the behavior of the average reward as a function of a policy in systems represented by a linear PRP. Two questions arise in this context. First, it is not clear whether the average reward is well defined for any finite memory policy. It has been shown that systems represented by a linear PSR are AMS for any finite memory policy (Grinberg & Precup, 2012). This fact guarantees that averages of different observable quantities from action observation pairs are well defined, but not necessarily the average reward itself. Second, when/if the average reward is well defined, we want to characterize its behavior as a function of the policy parameters. Both questions are addressed in Theorem 4. We show that under mild conditions the average reward exists and is linear in a quantity derived from the stationary distribution of the underlying PSR. Theorem 4. Let a n-dimensional linear PRP be ergodic for a collection of stochastic finite state policies of size m given by direct parametrization θ Θ. Let ρ θ be the stationary state of the reward process without control induced by policy θ. Denote by ρ P θ RP R n and ρ P θ ol R m the stationary states of the PRP and the policy correspondingly, obtained from ρ θ. Also, let a A : ρ P θ RP (a) R n be the PRP state obtained from starting the PRP at state ρ P θ RP and taking action a. Then, the average reward is a rational function of the policy parameters, defined by: k 1 1 lim R t = r a ρ P θ RP (a) θa ρ P θ ol k k t=0 a A a.s., where the rewards are collected by following policy θ, and a A : θ a represents the vector of probabilities of taking action a in each state of the policy. The form of the average reward in a linear PRP is not surprising, since the average reward takes the same form in the MDP/POMDP setting. However, this result is important not only due to the generalization

6 of the average reward behavior from POMDPs to linear PRPs, but because of its implications on the complexity of the average reward function. The following corollary connects the complexity of the average reward function to the dimensionality of the PSR representation and the number of hidden states in the corresponding POMDP representation. Since the average reward is a rational function of policy parameters in both cases, we measure the complexity of this function in terms of the degree of the multivariate polynomials appearing in either numerator or denominator, whichever is larger. Corollary 5. Consider a system that can be represented by both POMDP with m hidden states and n-dimensional linear PRP. Let k be the size of the stochastic finite state controllers under consideration. The degree of the average reward function, when analyzed using the linear PRP representation is O(kn), which can be (significantly) smaller compared to the function of degree O(km) obtained in the POMDP framework. In the following subsections, we address two points that were not yet discussed in enough detail. First, we discuss the representational power of a linear PRP, compared to that of a POMDP. Second, we provide several synthetic examples of linear PRPs whose dimension is smaller than that of a corresponding POMDP. These examples highlight the benefits of the linear PRP framework Representation power of a linear PRP As outlined above, existing planning algorithms in PSRs fall under the linear PRP representation assumptions. However, certain reward processes can be represented with a POMDP but not with a linear PRP. Although this might seem as a disadvantage, we argue that such systems are ill modeled from the perspective of reinforcement learning. The discrepancy arises from the fact that although the belief state of a POMDP is always sufficient to predict the expected reward, the predictive state for the future action observation sequences is not necessarily sufficient. For example, one can think of a POMDP with aliased hidden states that only differ in terms of their reward function. However, this setting undermines the classical approach of constructing policies solely based on actions and observations, since it is clear that knowing the reward signal allows making better predictions of future rewards in this case. If the model of the system is not known a priori, it is even less clear how to learn such a model. To the best of our knowledge, current state-of-the-art model learning techniques estimate the reward function after the action observation process has been modeled. Even when the model is known, peculiar behavior might take place if the policy ignores the reward signal, e.g., the average reward can depend on the initial hidden state of the POMDP, despite the hidden dynamics being policy independent (Yu & Bertsekas, 2008). Hence, in reinforcement learning, it is reasonable to assume that the observations provide enough information about the reward, as is the case in the linear PRP setting Linear PRP examples We now present a few synthetic examples of dynamical systems with control that can be represented exactly by a PRP significantly more compactly, compared to the most compact representation of the system in the POMDP framework. The examples are based on the probability clock example without control from Jaeger (2000), described in the OOM framework. Although these are synthetic examples, such clocks are in fact common in nature, e.g. in bistable biological systems (Chaves et al., 2008). The shared property among our examples is the fact that at least one of the PSR matrices is a rotation matrix. The angle of the rotation will, in fact, determine the minimum number of states that the POMDP needs to represent the system. We first describe the probability clock example, since all the following systems are based on the same concept. Consider a system with two observations O = {o 1, o 2 }, and 3-dimensional state. The initial state is a vector s 0 = (0.75, 0, 0.25), and the following two matrices represent the change of state corresponding to each of the observations: M o1 M o1 = 0.5 M o2 = s cos(α) sin(α) 0 sin(α) cos(α) [cos(α) sin(α)] 1 0.5[cos(α) + sin(α)], performs the rotation of the state by an angle α around the first axis, and M o2 resets the state back to s 0. The predictions of the system are given by : P(o (1) o (2),..., o (k) ) = 1 M o (k) M o (1)s 0. The name of this example is due to the interesting behavior of the one step prediction of o 1 : if we observe only sequences of o 1 -s this prediction oscillates (the pattern is as in Fig. 1). The minimum number of states required for an HMM to represent this system exactly is the minimum k such that k α is a multiple of.

7 2π (Jaeger, 2000). Hence, depending on α, one might require an arbitrarily large number of states in the corresponding HMM for an exact representation; an infinite number of states is needed if α is an irrational degree. We now present several systems with control and reward functions that generalize the idea of the probability clock example to POMDPs. All the examples consider two-action (A = {a, b}) two-observation (O = {o 1, o 2 }) systems, such that policies with small memory perform (at times significantly) better than constant policies. The first and perhaps the most interesting scenario is a simple extension of the three dimensional probability clock. Let M a,o1 M o1, M a,o2 M o2, M b,o1 M b,o2 0.5 I, where I is an identity matrix of appropriate dimension. Hence, taking action a will result in the same behavior as that of the probability clock example. Taking action b, though, will result in an i.i.d sequence of coins flips - observations o 1, o 2 - but will not change the state. We equip this system with a reward signal and let its expectation be equal to P(o 1 s, a), where s is the current system s state. This specification satisfies the requirement of a linear PRP, since the expected reward is a linear function of the system s state. The optimal behavior is to take action a until the system reaches the state with largest P(o 1 s, a), then choose action b thereafter. The above example should be interpreted as the system that can operate under different modes. Action b runs the system in a given mode while action a changes the mode of the system in a stochastic fashion. Clearly, one can generalize this example to the setting in which the system s operation itself requires several hidden states to represent, more actions are involved, etc. Operating the system in different modes does not affect the dynamics between these states, but potentially affects the reward obtained from each state-action pair. Following the same reasoning as above, it is clear that such a system might require significantly more hidden states than number of dimensions, if represented in the POMDP framework. We note that the lac operon (see e.g. Chaves et al. (2008)), as well as other genetic networks related to metabolizing different types of nutrients, exhibit this type of multiple mode operation, with rewards dependent on the mode. The rest of the examples illustrate the flexibility of the probability clock with respect to the choice of parameters. The reward function is the same for all cases: 1 for observation o 1 and 0 for o 2 ; hence, the objective is to maximize the average probability of o 1 per step. As before, we let o 1 be the observation that rotates the state and o 2 be the reset. Figure 1 illustrates a system in which both actions behave as probability clocks starting from the same state, but having different rotation angles. The policy that chooses either action a or b at all times obtains an average reward of 0.69, while the policy that takes action a three times given that no reset occurred, and then b until the next reset, obtains an average reward of Figure 2 illustrates an example in which two actions can have different resetting states, rotation angles and magnitudes of the cycles. A constant policy choosing one action at all times will obtain average reward less than Yet, the policy that climbs the hill using action b and, once reset, chooses action a until the next reset, achieves an average reward of Such a policy requires only two internal states. The behavior of the average reward as a function of some of the two-state policy parameters is also presented in Figure 2. Another example of a system whose actions rotate the state in opposite directions and have slightly different resetting states can be found in the supplementary material depicted in Figure 3. A constant policy choosing one action at all times will obtain average reward 0.43, while the policy that changes its action right after observing a reset achieves average reward of As in the previous example, such a policy requires only two internal states, with average reward behaving smoothly as shown in Figure 3. All these examples are meant to give a sense of a type of systems that can benefit from the linear PRP representation. Although these systems require a much larger number of hidden states to be represented in the POMDP framework, they only require a 3- dimensional linear PRP representation. This guarantees that the average reward is a simple function of a policy with small memory, suggesting that appropriate policy search techniques can be very efficient. 5. Conclusion and future work We proposed a formal definition for the reward process based on the linear PSR framework, and analyzed some of its properties. We proved that the average reward is a rational function of the policy parameters, and its complexity depends on the dimension of the underlying linear PSR. This suggests that systems represented by a small linear PRP should be amenable to effective policy search techniques. For example, one can use gradient-based policy search methods such as GPOMDP (Baxter & Bartlett, 2001) to efficiently find

8 Figure 1. A system which rotates the state by a different angle for each action: a 45, b 18. a policy with an acceptable performance. However, previous policy search methods do not seem to exploit the shape of the average reward function derived in this paper. As a result, this is a particularly interesting avenue for global policy search methods since properties like smoothness of the function are typically easy to exploit using this type of search. Moreover, the knowledge about the possible dimensionality of the system can boost the policy search even more. The development of suitable approaches and their analysis remains the main direction for future work. Another direction currently under investigation is the analysis of the behavior of the average reward function for policies dependent on the predictive sufficient statistic of the process, i.e. the underlying state of the linear PSR (Aberdeen et al., 2007). This analysis could further shed light on how the error in an approximated PSR state affects the policy performance. Finally, the established result on the stationary distribution of a linear PSR can also be useful in the future development of stability guarantees for linear PSRs. Figure 2. The first two plots describe the behavior of the system: 1) it rotates the state by a different angle for each action: a 18, b 2 ; 2) the resetting state for action a corresponds to a basin of the cycle, while the resetting state for action b corresponds to a near top of the cycle; 3) the magnitudes of the periods are different between the actions. The bottom plot demonstrates how the average reward changes as a function of α and β, where the policies having 2 hidden states (S {1, 2}) are parametrized as: P π(a S = 1) = P π(b S = 2) = 1, P π(s = 2 S = 1, a) = P π(s = 1 S = 2, a) = α, P π(s = 1 S = 1, b) = P π(s = 2 S = 2, b) = β.

9 References Aberdeen, D., Buffet, O., and Thomas, O. Policygradients for PSRs and POMDPs. In Proceedings of international conference on artificial intelligence and statistics, Baxter, J. and Bartlett, P.L. Infinite-horizon policygradient estimation. Journal of Artificial Intelligence Research, 15: , Boots, B., Siddiqi, S.M., and Gordon, G.J. Closing the learning-planning loop with predictive state representations. In Proceedings of the 9th international conference on autonomous agents and multiagent systems, pp , Chaves, M., Eissing, T., and Allgower, F. Bistable biological systems: A characterization through local compact input-to-state stability. Automatic Control, IEEE Transactions on, 53(Special Issue):87 100, Faigle, U. and Schonhuth, A. Asymptotic mean stationarity of sources with finite evolution dimension. Information Theory, IEEE Transactions on, 53(7): , Gray, R.M. Probability, random processes, and ergodic properties. Springer, Gray, R.M. and Kieffer, JC. Asymptotically mean stationary measures. The Annals of Probability, 8(5): , Grinberg, Y. and Precup, D. On average reward policy evaluation in infinite state partially observable systems. In Proceedings of international conference on artificial intelligence and statistics, Izadi, M.T. and Precup, D. A planning algorithm for predictive state representations. In Proceedings of the 18th international joint conference on artificial intelligence, pp Morgan Kaufmann Publishers Inc., Kaelbling, L.P., Littman, M.L., and Cassandra, A.R. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101(1):99 134, Littman, M.L., Sutton, R.S., and Singh, S. Predictive representations of state. Advances in neural information processing systems, 14: , Peters, J. and Schaal, S. Natural actor-critic. Neurocomputing, 71(7): , Rosencrantz, M., Gordon, G., and Thrun, S. Learning low dimensional predictive representations. In Proceedings of the twenty-first international conference on machine learning, pp. 88, Schweitzer, P.J. Perturbation theory and finite Markov chains. Journal of Applied Probability, pp , Singh, S., James, M.R., and Rudary, M.R. Predictive state representations: A new theory for modeling dynamical systems. In Proceedings of the 20th conference on uncertainty in artificial intelligence, pp AUAI Press, Singh, S.P., Jaakkola, T., and Jordan, M.I. Learning without state-estimation in partially observable Markovian decision processes. In Proceedings of the eleventh international conference on machine learning, volume 31, pp. 37. Citeseer, Sutton, R. sutton/book/errata.html, Wiewiora, E.W. Modeling probability distributions with predictive state representations. PhD thesis, The University of California at San Diego, Yu, H. and Bertsekas, D.P. On near optimality of the set of finite-state controllers for average cost POMDP. Mathematics of Operations Research, 33 (1):1 11, Jaeger, H. Observable operator models for discrete stochastic time series. Neural Computation, 12(6): , James, M.R. Using predictions for planning and modeling in stochastic environments. PhD thesis, The University of Michigan, James, M.R., Singh, S., and Littman, M.L. Planning with predictive state representations. In The 2004 international conference on machine learning and applications, 2004.

On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets

On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets Pablo Samuel Castro pcastr@cs.mcgill.ca McGill University Joint work with: Doina Precup and Prakash

More information

Predictive Timing Models

Predictive Timing Models Predictive Timing Models Pierre-Luc Bacon McGill University pbacon@cs.mcgill.ca Borja Balle McGill University bballe@cs.mcgill.ca Doina Precup McGill University dprecup@cs.mcgill.ca Abstract We consider

More information

Combining Memory and Landmarks with Predictive State Representations

Combining Memory and Landmarks with Predictive State Representations Combining Memory and Landmarks with Predictive State Representations Michael R. James and Britton Wolfe and Satinder Singh Computer Science and Engineering University of Michigan {mrjames, bdwolfe, baveja}@umich.edu

More information

Learning Low Dimensional Predictive Representations

Learning Low Dimensional Predictive Representations Learning Low Dimensional Predictive Representations Matthew Rosencrantz MROSEN@CS.CMU.EDU Computer Science Department, Carnegie Mellon University, Forbes Avenue, Pittsburgh, PA, USA Geoff Gordon GGORDON@CS.CMU.EDU

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department Dialogue management: Parametric approaches to policy optimisation Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department 1 / 30 Dialogue optimisation as a reinforcement learning

More information

Modelling Sparse Dynamical Systems with Compressed Predictive State Representations

Modelling Sparse Dynamical Systems with Compressed Predictive State Representations with Compressed Predictive State Representations William L. Hamilton whamil3@cs.mcgill.ca Reasoning and Learning Laboratory, School of Computer Science, McGill University, Montreal, QC, Canada Mahdi Milani

More information

State Space Abstraction for Reinforcement Learning

State Space Abstraction for Reinforcement Learning State Space Abstraction for Reinforcement Learning Rowan McAllister & Thang Bui MLG, Cambridge Nov 6th, 24 / 26 Rowan s introduction 2 / 26 Types of abstraction [LWL6] Abstractions are partitioned based

More information

A Nonlinear Predictive State Representation

A Nonlinear Predictive State Representation Draft: Please do not distribute A Nonlinear Predictive State Representation Matthew R. Rudary and Satinder Singh Computer Science and Engineering University of Michigan Ann Arbor, MI 48109 {mrudary,baveja}@umich.edu

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

Reinforcement Learning and Deep Reinforcement Learning

Reinforcement Learning and Deep Reinforcement Learning Reinforcement Learning and Deep Reinforcement Learning Ashis Kumer Biswas, Ph.D. ashis.biswas@ucdenver.edu Deep Learning November 5, 2018 1 / 64 Outlines 1 Principles of Reinforcement Learning 2 The Q

More information

A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation

A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation Richard S. Sutton, Csaba Szepesvári, Hamid Reza Maei Reinforcement Learning and Artificial Intelligence

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement

More information

Optimal Convergence in Multi-Agent MDPs

Optimal Convergence in Multi-Agent MDPs Optimal Convergence in Multi-Agent MDPs Peter Vrancx 1, Katja Verbeeck 2, and Ann Nowé 1 1 {pvrancx, ann.nowe}@vub.ac.be, Computational Modeling Lab, Vrije Universiteit Brussel 2 k.verbeeck@micc.unimaas.nl,

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

State Space Compression with Predictive Representations

State Space Compression with Predictive Representations State Space Compression with Predictive Representations Abdeslam Boularias Laval University Quebec GK 7P4, Canada Masoumeh Izadi McGill University Montreal H3A A3, Canada Brahim Chaib-draa Laval University

More information

POMDPs and Policy Gradients

POMDPs and Policy Gradients POMDPs and Policy Gradients MLSS 2006, Canberra Douglas Aberdeen Canberra Node, RSISE Building Australian National University 15th February 2006 Outline 1 Introduction What is Reinforcement Learning? Types

More information

Learning in Zero-Sum Team Markov Games using Factored Value Functions

Learning in Zero-Sum Team Markov Games using Factored Value Functions Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer

More information

Basis Construction from Power Series Expansions of Value Functions

Basis Construction from Power Series Expansions of Value Functions Basis Construction from Power Series Expansions of Value Functions Sridhar Mahadevan Department of Computer Science University of Massachusetts Amherst, MA 3 mahadeva@cs.umass.edu Bo Liu Department of

More information

Spectral Learning of Predictive State Representations with Insufficient Statistics

Spectral Learning of Predictive State Representations with Insufficient Statistics Spectral Learning of Predictive State Representations with Insufficient Statistics Alex Kulesza and Nan Jiang and Satinder Singh Computer Science & Engineering University of Michigan Ann Arbor, MI, USA

More information

Policy Gradient Reinforcement Learning for Robotics

Policy Gradient Reinforcement Learning for Robotics Policy Gradient Reinforcement Learning for Robotics Michael C. Koval mkoval@cs.rutgers.edu Michael L. Littman mlittman@cs.rutgers.edu May 9, 211 1 Introduction Learning in an environment with a continuous

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

Efficient Learning and Planning with Compressed Predictive States

Efficient Learning and Planning with Compressed Predictive States Journal of Machine Learning Research 15 (2014) 3575-3619 Submitted 12/13; Revised 7/14; Published 11/14 Efficient Learning and Planning with Compressed Predictive States William Hamilton Mahdi Milani Fard

More information

Optimism in the Face of Uncertainty Should be Refutable

Optimism in the Face of Uncertainty Should be Refutable Optimism in the Face of Uncertainty Should be Refutable Ronald ORTNER Montanuniversität Leoben Department Mathematik und Informationstechnolgie Franz-Josef-Strasse 18, 8700 Leoben, Austria, Phone number:

More information

Abstraction in Predictive State Representations

Abstraction in Predictive State Representations Abstraction in Predictive State Representations Vishal Soni and Satinder Singh Computer Science and Engineering University of Michigan, Ann Arbor {soniv,baveja}@umich.edu Abstract Most work on Predictive

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Artificial Intelligence Review manuscript No. (will be inserted by the editor) Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Howard M. Schwartz Received:

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Emphatic Temporal-Difference Learning

Emphatic Temporal-Difference Learning Emphatic Temporal-Difference Learning A Rupam Mahmood Huizhen Yu Martha White Richard S Sutton Reinforcement Learning and Artificial Intelligence Laboratory Department of Computing Science, University

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Reinforcement Learning for NLP

Reinforcement Learning for NLP Reinforcement Learning for NLP Advanced Machine Learning for NLP Jordan Boyd-Graber REINFORCEMENT OVERVIEW, POLICY GRADIENT Adapted from slides by David Silver, Pieter Abbeel, and John Schulman Advanced

More information

1 Problem Formulation

1 Problem Formulation Book Review Self-Learning Control of Finite Markov Chains by A. S. Poznyak, K. Najim, and E. Gómez-Ramírez Review by Benjamin Van Roy This book presents a collection of work on algorithms for learning

More information

Finite-State Controllers Based on Mealy Machines for Centralized and Decentralized POMDPs

Finite-State Controllers Based on Mealy Machines for Centralized and Decentralized POMDPs Finite-State Controllers Based on Mealy Machines for Centralized and Decentralized POMDPs Christopher Amato Department of Computer Science University of Massachusetts Amherst, MA 01003 USA camato@cs.umass.edu

More information

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN Reinforcement Learning for Continuous Action using Stochastic Gradient Ascent Hajime KIMURA, Shigenobu KOBAYASHI Tokyo Institute of Technology, 4259 Nagatsuda, Midori-ku Yokohama 226-852 JAPAN Abstract:

More information

Covariant Policy Search

Covariant Policy Search Covariant Policy Search J. Andrew Bagnell and Jeff Schneider Robotics Institute Carnegie-Mellon University Pittsburgh, PA 15213 { dbagnell, schneide } @ ri. emu. edu Abstract We investigate the problem

More information

Planning with Predictive State Representations

Planning with Predictive State Representations Planning with Predictive State Representations Michael R. James University of Michigan mrjames@umich.edu Satinder Singh University of Michigan baveja@umich.edu Michael L. Littman Rutgers University mlittman@cs.rutgers.edu

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

ilstd: Eligibility Traces and Convergence Analysis

ilstd: Eligibility Traces and Convergence Analysis ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Planning in POMDPs Using Multiplicity Automata

Planning in POMDPs Using Multiplicity Automata Planning in POMDPs Using Multiplicity Automata Eyal Even-Dar School of Computer Science Tel-Aviv University Tel-Aviv, Israel 69978 evend@post.tau.ac.il Sham M. Kakade Computer and Information Science University

More information

Artificial Intelligence & Sequential Decision Problems

Artificial Intelligence & Sequential Decision Problems Artificial Intelligence & Sequential Decision Problems (CIV6540 - Machine Learning for Civil Engineers) Professor: James-A. Goulet Département des génies civil, géologique et des mines Chapter 15 Goulet

More information

Learning to Coordinate Efficiently: A Model-based Approach

Learning to Coordinate Efficiently: A Model-based Approach Journal of Artificial Intelligence Research 19 (2003) 11-23 Submitted 10/02; published 7/03 Learning to Coordinate Efficiently: A Model-based Approach Ronen I. Brafman Computer Science Department Ben-Gurion

More information

Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results

Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results Chandrashekar Lakshmi Narayanan Csaba Szepesvári Abstract In all branches of

More information

10 Robotic Exploration and Information Gathering

10 Robotic Exploration and Information Gathering NAVARCH/EECS 568, ROB 530 - Winter 2018 10 Robotic Exploration and Information Gathering Maani Ghaffari April 2, 2018 Robotic Information Gathering: Exploration and Monitoring In information gathering

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

A System Theoretic Perspective of Learning and Optimization

A System Theoretic Perspective of Learning and Optimization A System Theoretic Perspective of Learning and Optimization Xi-Ren Cao* Hong Kong University of Science and Technology Clear Water Bay, Kowloon, Hong Kong eecao@ee.ust.hk Abstract Learning and optimization

More information

Learning to Make Predictions In Partially Observable Environments Without a Generative Model

Learning to Make Predictions In Partially Observable Environments Without a Generative Model Journal of Artificial Intelligence Research 42 (2011) 353-392 Submitted 5/11; published 11/11 Learning to Make Predictions In Partially Observable Environments Without a Generative Model Erik Talvitie

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

Multi-Attribute Bayesian Optimization under Utility Uncertainty

Multi-Attribute Bayesian Optimization under Utility Uncertainty Multi-Attribute Bayesian Optimization under Utility Uncertainty Raul Astudillo Cornell University Ithaca, NY 14853 ra598@cornell.edu Peter I. Frazier Cornell University Ithaca, NY 14853 pf98@cornell.edu

More information

A Decentralized Approach to Multi-agent Planning in the Presence of Constraints and Uncertainty

A Decentralized Approach to Multi-agent Planning in the Presence of Constraints and Uncertainty 2011 IEEE International Conference on Robotics and Automation Shanghai International Conference Center May 9-13, 2011, Shanghai, China A Decentralized Approach to Multi-agent Planning in the Presence of

More information

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs 2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter

More information

A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies

A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies Huizhen Yu Lab for Information and Decision Systems Massachusetts Institute of Technology Cambridge,

More information

Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms

Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms Jonathan Baxter and Peter L. Bartlett Research School of Information Sciences and Engineering Australian National University

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Linearly-solvable Markov decision problems

Linearly-solvable Markov decision problems Advances in Neural Information Processing Systems 2 Linearly-solvable Markov decision problems Emanuel Todorov Department of Cognitive Science University of California San Diego todorov@cogsci.ucsd.edu

More information

Kalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN)

Kalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN) Kalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN) Alp Sardag and H.Levent Akin Bogazici University Department of Computer Engineering 34342 Bebek, Istanbul,

More information

Coarticulation in Markov Decision Processes Khashayar Rohanimanesh Robert Platt Sridhar Mahadevan Roderic Grupen CMPSCI Technical Report 04-33

Coarticulation in Markov Decision Processes Khashayar Rohanimanesh Robert Platt Sridhar Mahadevan Roderic Grupen CMPSCI Technical Report 04-33 Coarticulation in Markov Decision Processes Khashayar Rohanimanesh Robert Platt Sridhar Mahadevan Roderic Grupen CMPSCI Technical Report 04- June, 2004 Department of Computer Science University of Massachusetts

More information

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.

More information

Learning Control Under Uncertainty: A Probabilistic Value-Iteration Approach

Learning Control Under Uncertainty: A Probabilistic Value-Iteration Approach Learning Control Under Uncertainty: A Probabilistic Value-Iteration Approach B. Bischoff 1, D. Nguyen-Tuong 1,H.Markert 1 anda.knoll 2 1- Robert Bosch GmbH - Corporate Research Robert-Bosch-Str. 2, 71701

More information

The convergence limit of the temporal difference learning

The convergence limit of the temporal difference learning The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector

More information

PolicyBoost: Functional Policy Gradient with Ranking-Based Reward Objective

PolicyBoost: Functional Policy Gradient with Ranking-Based Reward Objective AI and Robotics: Papers from the AAAI-14 Worshop PolicyBoost: Functional Policy Gradient with Raning-Based Reward Objective Yang Yu and Qing Da National Laboratory for Novel Software Technology Nanjing

More information

Lecture notes for Analysis of Algorithms : Markov decision processes

Lecture notes for Analysis of Algorithms : Markov decision processes Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Lecture 3: The Reinforcement Learning Problem

Lecture 3: The Reinforcement Learning Problem Lecture 3: The Reinforcement Learning Problem Objectives of this lecture: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Learning and Planning with Timing Information in Markov Decision Processes

Learning and Planning with Timing Information in Markov Decision Processes Learning and Planning with Timing Information in Markov Decision Processes Pierre-Luc Bacon, Borja Balle, and Doina Precup Reasoning and Learning Lab, School of Computer Science, McGill University {pbacon,bballe,dprecup}@cs.mcgill.ca

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Coarticulation in Markov Decision Processes

Coarticulation in Markov Decision Processes Coarticulation in Markov Decision Processes Khashayar Rohanimanesh Department of Computer Science University of Massachusetts Amherst, MA 01003 khash@cs.umass.edu Sridhar Mahadevan Department of Computer

More information

arxiv: v1 [cs.ai] 1 Jul 2015

arxiv: v1 [cs.ai] 1 Jul 2015 arxiv:507.00353v [cs.ai] Jul 205 Harm van Seijen harm.vanseijen@ualberta.ca A. Rupam Mahmood ashique@ualberta.ca Patrick M. Pilarski patrick.pilarski@ualberta.ca Richard S. Sutton sutton@cs.ualberta.ca

More information

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning Active Policy Iteration: fficient xploration through Active Learning for Value Function Approximation in Reinforcement Learning Takayuki Akiyama, Hirotaka Hachiya, and Masashi Sugiyama Department of Computer

More information

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69 R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual

More information

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012 CSE 573: Artificial Intelligence Autumn 2012 Reasoning about Uncertainty & Hidden Markov Models Daniel Weld Many slides adapted from Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer 1 Outline

More information

Temporal Difference Learning & Policy Iteration

Temporal Difference Learning & Policy Iteration Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.

More information

State Space Abstractions for Reinforcement Learning

State Space Abstractions for Reinforcement Learning State Space Abstractions for Reinforcement Learning Rowan McAllister and Thang Bui MLG RCC 6 November 24 / 24 Outline Introduction Markov Decision Process Reinforcement Learning State Abstraction 2 Abstraction

More information

Learning Multi-Step Predictive State Representations

Learning Multi-Step Predictive State Representations Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16) Learning Multi-Step Predictive State Representations Lucas Langer McGill University Canada Borja Balle

More information

Logic, Knowledge Representation and Bayesian Decision Theory

Logic, Knowledge Representation and Bayesian Decision Theory Logic, Knowledge Representation and Bayesian Decision Theory David Poole University of British Columbia Overview Knowledge representation, logic, decision theory. Belief networks Independent Choice Logic

More information

University of Alberta

University of Alberta University of Alberta NEW REPRESENTATIONS AND APPROXIMATIONS FOR SEQUENTIAL DECISION MAKING UNDER UNCERTAINTY by Tao Wang A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment

More information

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G. In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and J. Alspector, (Eds.). Morgan Kaufmann Publishers, San Fancisco, CA. 1994. Convergence of Indirect Adaptive Asynchronous

More information

Statistics 992 Continuous-time Markov Chains Spring 2004

Statistics 992 Continuous-time Markov Chains Spring 2004 Summary Continuous-time finite-state-space Markov chains are stochastic processes that are widely used to model the process of nucleotide substitution. This chapter aims to present much of the mathematics

More information

Efficient Sensitivity Analysis in Hidden Markov Models

Efficient Sensitivity Analysis in Hidden Markov Models Efficient Sensitivity Analysis in Hidden Markov Models Silja Renooij Department of Information and Computing Sciences, Utrecht University P.O. Box 80.089, 3508 TB Utrecht, The Netherlands silja@cs.uu.nl

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Reduced-Rank Hidden Markov Models

Reduced-Rank Hidden Markov Models Reduced-Rank Hidden Markov Models Sajid M. Siddiqi Byron Boots Geoffrey J. Gordon Carnegie Mellon University ... x 1 x 2 x 3 x τ y 1 y 2 y 3 y τ Sequence of observations: Y =[y 1 y 2 y 3... y τ ] Assume

More information

Optimal Control of Partiality Observable Markov. Processes over a Finite Horizon

Optimal Control of Partiality Observable Markov. Processes over a Finite Horizon Optimal Control of Partiality Observable Markov Processes over a Finite Horizon Report by Jalal Arabneydi 04/11/2012 Taken from Control of Partiality Observable Markov Processes over a finite Horizon by

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function Approximation Continuous state/action space, mean-square error, gradient temporal difference learning, least-square temporal difference, least squares policy iteration Vien

More information

Partially Observable Markov Decision Processes (POMDPs)

Partially Observable Markov Decision Processes (POMDPs) Partially Observable Markov Decision Processes (POMDPs) Geoff Hollinger Sequential Decision Making in Robotics Spring, 2011 *Some media from Reid Simmons, Trey Smith, Tony Cassandra, Michael Littman, and

More information

Markov Chains, Random Walks on Graphs, and the Laplacian

Markov Chains, Random Walks on Graphs, and the Laplacian Markov Chains, Random Walks on Graphs, and the Laplacian CMPSCI 791BB: Advanced ML Sridhar Mahadevan Random Walks! There is significant interest in the problem of random walks! Markov chain analysis! Computer

More information

Dual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks

Dual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks Dual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks Kary Främling Helsinki University of Technology, PL 55, FI-25 TKK, Finland Kary.Framling@hut.fi Abstract. Reinforcement

More information

arxiv: v2 [math.oc] 13 Feb 2016

arxiv: v2 [math.oc] 13 Feb 2016 Geometry and Determinism of Optimal Stationary Control in Partially Observable Markov Decision Processes arxiv:537v [mathoc] 3 Feb Guido Montúfar Max Planck Institute for Mathematics in the Sciences 43

More information

Autonomous Helicopter Flight via Reinforcement Learning

Autonomous Helicopter Flight via Reinforcement Learning Autonomous Helicopter Flight via Reinforcement Learning Authors: Andrew Y. Ng, H. Jin Kim, Michael I. Jordan, Shankar Sastry Presenters: Shiv Ballianda, Jerrolyn Hebert, Shuiwang Ji, Kenley Malveaux, Huy

More information

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount Yinyu Ye Department of Management Science and Engineering and Institute of Computational

More information

PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION PING HOU. A dissertation submitted to the Graduate School

PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION PING HOU. A dissertation submitted to the Graduate School PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION BY PING HOU A dissertation submitted to the Graduate School in partial fulfillment of the requirements for the degree Doctor of Philosophy Major Subject:

More information

CAP Plan, Activity, and Intent Recognition

CAP Plan, Activity, and Intent Recognition CAP6938-02 Plan, Activity, and Intent Recognition Lecture 10: Sequential Decision-Making Under Uncertainty (part 1) MDPs and POMDPs Instructor: Dr. Gita Sukthankar Email: gitars@eecs.ucf.edu SP2-1 Reminder

More information

A Control-Theoretic Perspective on the Design of Distributed Agreement Protocols, Part

A Control-Theoretic Perspective on the Design of Distributed Agreement Protocols, Part 9. A Control-Theoretic Perspective on the Design of Distributed Agreement Protocols, Part Sandip Roy Ali Saberi Kristin Herlugson Abstract This is the second of a two-part paper describing a control-theoretic

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

The Markov Decision Process Extraction Network

The Markov Decision Process Extraction Network The Markov Decision Process Extraction Network Siegmund Duell 1,2, Alexander Hans 1,3, and Steffen Udluft 1 1- Siemens AG, Corporate Research and Technologies, Learning Systems, Otto-Hahn-Ring 6, D-81739

More information