Average Reward Optimization Objective In Partially Observable Domains
|
|
- Sibyl Briggs
- 5 years ago
- Views:
Transcription
1 Average Reward Optimization Objective In Partially Observable Domains Yuri Grinberg School of Computer Science, McGill University, Canada Doina Precup School of Computer Science, McGill University, Canada Abstract We consider the problem of average reward optimization in domains with partial observability, within the modeling framework of linear predictive state representations (PSRs) (Littman et al., 2001). The key to average-reward computation is to have a welldefined stationary behavior of a system, so the required averages can be computed. If, additionally, the stationary behavior varies smoothly with changes in policy parameters, average-reward control through policy search also becomes a possibility. In this paper, we show that PSRs have a well-behaved stationary distribution, which is a rational function of policy parameters. Based on this result, we define a related reward process particularly suitable for average reward optimization, and analyze its properties. We show that in such a predictive state reward process, the average reward is a rational function of the policy parameters, whose complexity depends on the dimension of the underlying linear PSR. This result suggests that average reward-based policy search methods can be effective when the dimension of the system is small, even when the system representation in the POMDP framework requires many hidden states. We provide illustrative examples of this type. 1. Introduction Partial observability is prevalent in practical domains, in which one has to model systems based on noisy Proceedings of the 30 th International Conference on Machine Learning, Atlanta, Georgia, USA, JMLR: W&CP volume 28. Copyright 2013 by the author(s). observations. In domains with partial observability and finite action and observation spaces, a popular framework for modeling and control is that of finite state partially observable Markov decision processes (POMDP) (Kaelbling et al., 1998). POMDPs generalize the well-known framework of Markov decision processes (MDPs); hence, they inherit many useful properties from MDPs. In particular, the results of Schweitzer (1968) imply that the stationary distribution of an ergodic Markov chain is a rational function of the changes in the transition matrix. Under mild conditions, this result can be applied immediately to finite state POMDPs: the stationary distribution of the Markov chain induced by some finite memory policy exists and is a rational function of policy parameters. This is important when one has to estimate expectations of different quantities (such as returns) with respect to the policy, since these expectations are taken with respect to the stationary behavior of the process. In particular, one can show that the average reward in finite state POMDPs is a linear function of the stationary distribution. Thus, knowing that the stationary distribution is robust to small changes in the policy implies that the estimates of different quantities based on the stationary distribution are robust as well. About a decade ago, a new modeling framework for finite action and observation spaces was introduced, named (linear) predictive state representations (PSR) (Littman et al., 2001; Singh et al., 2004). PSRs alleviate model learning challenges encountered in finite state POMDP models, because their state representation is grounded in measurable quantities. More importantly, PSRs are capable of representing some POMDPs with smaller dimension than the number of hidden states. Moreover, there are systems which can be represented by a linear PSR but not a POMDP (Jaeger, 2000).
2 PSRs are formulated in terms of actions and observations, but rewards do not play as special role in the original framework, as they do in MDPs and POMDPs. The planning problem using linear PSRs has already been tackled by several authors, e.g. James et al. (2004); Boots et al. (2010); Izadi & Precup (2003) but little was said about how their specific assumptions on the reward function affect the combined reward-psr process. The main motivation for our work is to provide a theoretical framework for reward processes based on PSRs, and analyze the behavior of the average reward in this framework. There are two practical implications of this idea. First, this would enable the development of learning and planning algorithms which are directly based on evaluating reward averages, rather than learning a full predictive representation first, and then using it to estimate policy values. In the MDP and POMDP framework, methods based on value functions and/or policy search are known to scale better to very large problems than methods based on a full model estimation. We would like to bring this advantage to linear PSRs as well. Second, defining an appropriate reward process based on PSRs enables an easier theoretical analysis of any control methods. In Section 4, we present a (linear) predictive state reward process (PRP), which is built on top of a (linear) PSR and provides a suitable framework for average reward optimization. We analyze some of its properties and discuss its relationship to the POMDP framework. Our most important result connects the size of the linear PSR representation underlying the linear PRP to the complexity of the average reward as a function of policy parameters. We prove that a compact PSR representation will induce a reward process with a simple average reward function. We provide some examples in which the linear PRP representation is much smaller than the corresponding POMDP representation; hence, using the PRP for policy search, for example, could potentially be much easier and yield a solution faster. The key ingredient in analyzing the PRP is to analyze the behavior of its stationary distribution as a function of the policy used, similar to the case of a POMDP. This stationary distribution is in fact the stationary distribution of the underlying linear PSR, which has been shown to exist (Faigle & Schonhuth, 2007) given a fixed policy. However, its form has not been stated before. In Section 3, we show that the stationary distribution induced by a finite memory policy in a linear PSR is a rational function of the policy parameters, as long as the process is ergodic, similar to a finite state POMDP framework. This result is a novel contribution in itself, and can be useful if a stability / perturbation analysis of a linear PSR is desired. Throughout the paper, we omit the proofs, for brevity; they are included in the appendix provided as supplementary material. 2. Background and notation We consider systems with control having finite action and observation/percept spaces, denoted as A and O respectively, and rewards coming from a bounded set R R. We define a reward process as a tuple (Ω, F, T, µ 0 ), where (Ω, F) is a measurable space, µ 0 is a probability kernel representing the effect of actions, and T is a shift operator. In our setting, the elements of Ω can be viewed as infinite sequences of action-observation-reward triples, ω Ω : ω = a 1, o 1, r 1, a 2, o 2, r 2,..., a i A, o i O, r i R. In this setting, T is defined by: T( a 1, o 1, r 1, a 2, o 2, r 2,... ) = a 2, o 2, r 2, a 3, o 3, r 3..., T 1 (ω) = {ω T(ω ) = ω}. At each time step, the reward process waits for the agent to take an action and outputs an observationreward pair. We are interested in the average reward setting, where the goal is to find a behavior (or policy) π maximizing lim n n t=1 R t where R t is the 1 n random variable denoting the reward observed at time t, when following policy π. The average reward setting is in fact preferable to discounted rewards in partially observable systems when the dynamics are not known 1 (Singh et al., 1994). We consider the classical policy definition, in which the action choice depends (stochastically) only on actions and observations, but not rewards: k N, π : A, O k [0, 1] A. Representing any arbitrary policy of this kind is infeasible, since it requires infinite memory. Therefore, only policies having a finite memory representation are considered. Such policies can be implemented through (stochastic) finite state controllers, whose parameters represent the transition probabilities between internal states of the policy, and the probabilities of actions conditioned on the policy state. We call this a direct policy parametrization. In particular, if the policy is memoryless and open-loop (i.e., it chooses actions with a fixed distribution regardless of the time step and history), its direct parametrization will simply be the vector representing the distribution over actions. 1 See also Sutton (1998), last bullet, for details; this addresses the function approximation in MDPs, which is essentially equivalent to POMDPs.
3 Fixing a policy π generates a stochastic process whose elements are action-observation-reward triples. We denote the distribution of this process by P π or P θ where θ is the parametrization of π, and the expectation with respect to P π by E π or E θ. P π is asymptotically mean stationary (AMS) (Gray & Kieffer, 1980) if F F : P n 1 1 π (F ) = lim P π (T i F ) n n i=0 exists; P π is called stationary mean; let E π be the expectation with respect to P π. The AMS property is a necessary and sufficient condition for the running averages of any bounded quantity to converge; in particular, it guarantees the existence of the average reward. For example, if the reward process is a finite state Markov decision process (MDP), the stationary mean is the stationary distribution of the Markov chain induced by policy π. Moreover, P π is ergodic if G G : P π (G) {0, 1}, where G F is a set of invariant events (T 1 G = G). Thus, if P π is AMS and ergodic, 1 n lim R t = E π (R) a.s. (P π, n n P π ), t=1 meaning that the average reward is constant almost surely (for more details see Gray (2009)). In a finite state MDP, the ergodicity of the process corresponds to the Markov chain induced by π being irreducible Predictive State Representations The idea of a predictive state of a system is rooted in the following observation: knowing the distribution of any finite sequence of observations given any sequence of actions describes completely and accurately the future dynamics of that system. In the setting we consider, the action and observation spaces are finite, meaning that one can represent the predictive state of the system with a countably infinite vector enumerating probabilities of finite sequences of observations given sequences of actions. We will refer to this vector as the state of the system. In general, it might not be possible to represent this vector concisely. In Littman et al. (2001), a new representation for the action-observation controllable process (defined without rewards) was proposed, called predictive state representation (PSR). It captures processes in which predictions of a finite number of sequences are enough to compute the prediction of any other sequence. Formally, let h, t A O be sequences of actionobservation pairs of finite length, where h is a history, i.e., a sequence observed until now, and t is a sequence that might occur in the future, called test. Let Q = {q 1,..., q n } be a set of tests, called core tests, and p(h) [P psr (q 1 h),..., P psr (q n h)] be the prediction vector where P psr (q h) µ0(hq) µ 0(h) represents the probability of observing percepts from q given that the agent has seen history h and is going to take actions from q. Then, p(h) is a predictive state representation of dimension n if and only if P psr (t h) = f t (p(h)), (1) for any test t and history h, where f t : [0, 1] n [0, 1] is any function independent of history. If f t is linear for all t, then the representation is called linear PSR, and P psr (t h) = m t p(h), where the function f t is replaced by a linear projection m t. It has been shown (Singh et al., 2004; Wiewiora, 2007) that a linear PSR possesses a finite parametrization of the form {m R n, {M ao R n n } a A,o O, p 0 R n }, satisfying the following properties: 1. m p 0 = 1, 2. a A : [ o O M ao] m = m, 3. P psr (a 1, o 1,..., a k, o k ) = m M a k,o k M a 1,o 1 p 0, where p 0 is the starting state of the linear PSR. Using property 3 one can verify that p(hao) M aop(h) m M aop(h) represents the PSR state after observing ao, given the previous PSR state p(h). Predictions of future sequences can be calculated using the new PSR state and the rest of the linear PSR parameters only. For better readability, from now on we drop the prefix linear from linear PSR and explicitly state non-linear when referring to the general PSR framework. Any finite memory policy π in a PSR induces an action-observation stochastic process, which can also be represented by a PSR, possibly of a larger dimension (Wiewiora, 2007). This process is often called PSR without control. In PSRs without control, property 2 reduces to [ ao A O M ao] m = m, and in property 3, P psr is replaced with P π, where π is the policy at hand. A PSR without control can be represented in at least two more frameworks: observable operator models (OOM) (Jaeger, 2000) and transformed PSRs (TPSR) (Boots et al., 2010; Rosencrantz
4 et al., 2004). These frameworks can be seen as different parametrizations of the same process, while keeping the same dimension to represent its state (Singh et al., 2004; Boots et al., 2010; James, 2005). Therefore, one can use any of these frameworks equivalently to analyze properties of the system. In particular, our analysis uses results obtained under the OOM framework (Faigle & Schonhuth, 2007). Finally, the key property of these frameworks is that certain systems can be represented more compactly compared to the classical finite state hidden Markov model (HMM) representation (Jaeger, 2000). Moreover, some of these systems might not be representable using finite-state HMMs at all. This is the reason why we focus on this type of representation. 3. The stationary distribution of a PSR In this section, we investigate how the stationary distribution of a PSR behaves as the policy changes, where by stationary distribution we mean the stationary state of the system represented by a PSR. For this purpose, we focus on the stationary distribution of a PSR without control induced by some policy, and see how changes in this policy affect this distribution. Let ρ π be the stationary distribution of a PSR without control induced by policy π 2. Let t N : p t be the expected state of the PSR at time t. Assuming that p 0 = ρ π, we have, by definition: Now, let t N : p t = ρ π. M ao A O M ao be the PSR evolution matrix. This is called evolution matrix due to the following, which can be easily seen from property 3: t N : p t = M p t 1. The matrix M will in fact define the stationary distribution of the PSR. Under the appropriate basis, it satisfies the conditions of Lemma 1 (Faigle & Schonhuth, 2007), and it can be seen as a generalized Markov transition matrix. More precisely, its spectral properties are the same as those of a usual Markov transition matrix, and the rows sum to 1, but some of the values might be negative (see Faigle & Schonhuth (2007) Section IV). The following Lemma shows that several important properties of Markov transition matrices generalize to evolution matrices as well (represented under the appropriate basis). 2 Note that ρ π represents P π, the stationary mean of the action-observation stochastic process. Lemma 1. Let E R n n be such that n 1 E 1 lim n n t=0 exists, and the rows of E sum to 1. Also assume that (E I) has rank n 1, where I is an identity matrix of appropriate size. Then the following holds: 1. EE = E E = E E = E. 2. (E E ) n = E n E. 3. E = 1ρ, where 1 is a column vector of ones; ρ is the unique column vector satisfying ρ E = ρ, ρ 1 = Let Z [I (E E )] 1, then Z is well defined and n 1 1 t Z = I + lim (E k E ). (2) n n 5. ρ Z = ρ. t=1 k=1 The next theorem extends Theorem 1 of Schweitzer (1968) to the case in which the two matrices at hand are evolution matrices. Theorem 2. Let E 1, E 2 be as in Lemma 1, and let E 1, E 2, ρ 1, ρ 2, Z 1, Z 2 be the quantities defined as in Lemma 1 with respect to E 1 and E 2 respectively. Then: E t H 1 2 [I (E 2 E 1 )Z 1 ] 1 (3) exists and is given by H 1 2 = Z 1 1 [I E 1 + E 2 ]Z 2. Moreover, ρ 1 H 1 2 = ρ 2. Theorem 2 lies in the heart of our work, since it establishes an exact relationship between the eigenvectors corresponding to eigenvalue 1 of two evolution matrices. Similarly to the theory of Markov chains, these eigenvectors represent the stationary distributions of the corresponding PSRs. Specifically, let E 1, Z 1 and ρ 1 be fixed, while E 2 varies. Theorem 2 shows that ρ 2 is a rational function of the entries in E 2. This is due to the fact that the matrix inverse can be calculated analytically through Cramer s rule. The next theorem is the main result of this section. Theorem 3. Let an action-observation controllable process S be representable by a finite dimensional linear PSR, and let θ Θ be a direct parametrization of a stochastic finite memory policy. If the stochastic process generated from S induced by any policy θ Θ is ergodic, then the stationary state of S is a rational function of θ.
5 The proof (included in the supplementary material) shows how to construct the parametrized evolution matrix in an appropriate basis, given a PSR without control parametrized by a policy. The construction guarantees that the entries of the evolution matrix are rational functions of the policy parameters. We show that this construction is well defined, and then invoke Theorem 2 to complete the proof. As mentioned, Theorem 3 requires the ergodicity condition for any policy under consideration. A violation of this condition implies the existence of a policy for which the initial state of the system affects its asymptotic behavior. In the theory of Markov chains, for example, this translates into the irreducibility condition on the chains induced by any policy from the set of policies considered. In particular, the irreducibility condition(s) is either satisfied for all strictly stochastic policies or not satisfied for any policy. Hence, this assumption is considered standard in the policy search literature (Baxter & Bartlett, 2001; Peters & Schaal, 2008). 4. Predictive-state reward processes In this section we define a (not necessarily linear) PSRbased reward process. The goal is to construct a reward process that inherits some of the useful properties of the PSR framework on one hand, while being general enough and suitable for average reward optimization on the other hand. In particular, we let the reward be continuous. Although we keep the definition quite general, we are interested in systems that are based on the linear PSR framework, and only this setting is treated afterwards. Let h (t) represent the sequence of action-observationreward triplets of length t, and h (t) represent the corresponding sequence of action-observation pairs only. Definition 1. A predictive state reward process (PRP) is a reward process whose action-observation component can be represented by a (possibly nonlinear) PSR with some state vector s R n, and whose reward R t at time t satisfies t N: E[R t h (t 1), a t, o t ] = E[R s( h (t) ), a t ] f R,at [s( h (t) )], where s( h (t) ) is the PSR state after observing h (t), and {f R,a } a A are functions independent of h (t). A linear PRP is a PRP whose action-observation component can be represented by a linear PSR, and in which f R,at are linear functions: a t A : f R,at [s( h (t) )] = r a t s( h (t) ), where a A : r a R n. The dimension of the PRP is defined to be the dimension of the underlying PSR. Note that Definition 1 in fact formalizes a setting under which most of the planning problem in linear PSRs has been tackled. For example, in Boots et al. (2010) the reward function is obtained by applying a linear regression from PSR states to the observed rewards. James et al. (2004) assumes that the rewards are discrete and incorporates them directly into the observation vector. Izadi & Precup (2003) defines the reward signal itself to be a linear function of the state of a linear PSR. Thus, current PSR based planning algorithms either implicitly assume that the underlying reward process is a linear PRP, or have more restrictive explicit assumptions. The rest of this section is devoted to the analysis of the behavior of the average reward as a function of a policy in systems represented by a linear PRP. Two questions arise in this context. First, it is not clear whether the average reward is well defined for any finite memory policy. It has been shown that systems represented by a linear PSR are AMS for any finite memory policy (Grinberg & Precup, 2012). This fact guarantees that averages of different observable quantities from action observation pairs are well defined, but not necessarily the average reward itself. Second, when/if the average reward is well defined, we want to characterize its behavior as a function of the policy parameters. Both questions are addressed in Theorem 4. We show that under mild conditions the average reward exists and is linear in a quantity derived from the stationary distribution of the underlying PSR. Theorem 4. Let a n-dimensional linear PRP be ergodic for a collection of stochastic finite state policies of size m given by direct parametrization θ Θ. Let ρ θ be the stationary state of the reward process without control induced by policy θ. Denote by ρ P θ RP R n and ρ P θ ol R m the stationary states of the PRP and the policy correspondingly, obtained from ρ θ. Also, let a A : ρ P θ RP (a) R n be the PRP state obtained from starting the PRP at state ρ P θ RP and taking action a. Then, the average reward is a rational function of the policy parameters, defined by: k 1 1 lim R t = r a ρ P θ RP (a) θa ρ P θ ol k k t=0 a A a.s., where the rewards are collected by following policy θ, and a A : θ a represents the vector of probabilities of taking action a in each state of the policy. The form of the average reward in a linear PRP is not surprising, since the average reward takes the same form in the MDP/POMDP setting. However, this result is important not only due to the generalization
6 of the average reward behavior from POMDPs to linear PRPs, but because of its implications on the complexity of the average reward function. The following corollary connects the complexity of the average reward function to the dimensionality of the PSR representation and the number of hidden states in the corresponding POMDP representation. Since the average reward is a rational function of policy parameters in both cases, we measure the complexity of this function in terms of the degree of the multivariate polynomials appearing in either numerator or denominator, whichever is larger. Corollary 5. Consider a system that can be represented by both POMDP with m hidden states and n-dimensional linear PRP. Let k be the size of the stochastic finite state controllers under consideration. The degree of the average reward function, when analyzed using the linear PRP representation is O(kn), which can be (significantly) smaller compared to the function of degree O(km) obtained in the POMDP framework. In the following subsections, we address two points that were not yet discussed in enough detail. First, we discuss the representational power of a linear PRP, compared to that of a POMDP. Second, we provide several synthetic examples of linear PRPs whose dimension is smaller than that of a corresponding POMDP. These examples highlight the benefits of the linear PRP framework Representation power of a linear PRP As outlined above, existing planning algorithms in PSRs fall under the linear PRP representation assumptions. However, certain reward processes can be represented with a POMDP but not with a linear PRP. Although this might seem as a disadvantage, we argue that such systems are ill modeled from the perspective of reinforcement learning. The discrepancy arises from the fact that although the belief state of a POMDP is always sufficient to predict the expected reward, the predictive state for the future action observation sequences is not necessarily sufficient. For example, one can think of a POMDP with aliased hidden states that only differ in terms of their reward function. However, this setting undermines the classical approach of constructing policies solely based on actions and observations, since it is clear that knowing the reward signal allows making better predictions of future rewards in this case. If the model of the system is not known a priori, it is even less clear how to learn such a model. To the best of our knowledge, current state-of-the-art model learning techniques estimate the reward function after the action observation process has been modeled. Even when the model is known, peculiar behavior might take place if the policy ignores the reward signal, e.g., the average reward can depend on the initial hidden state of the POMDP, despite the hidden dynamics being policy independent (Yu & Bertsekas, 2008). Hence, in reinforcement learning, it is reasonable to assume that the observations provide enough information about the reward, as is the case in the linear PRP setting Linear PRP examples We now present a few synthetic examples of dynamical systems with control that can be represented exactly by a PRP significantly more compactly, compared to the most compact representation of the system in the POMDP framework. The examples are based on the probability clock example without control from Jaeger (2000), described in the OOM framework. Although these are synthetic examples, such clocks are in fact common in nature, e.g. in bistable biological systems (Chaves et al., 2008). The shared property among our examples is the fact that at least one of the PSR matrices is a rotation matrix. The angle of the rotation will, in fact, determine the minimum number of states that the POMDP needs to represent the system. We first describe the probability clock example, since all the following systems are based on the same concept. Consider a system with two observations O = {o 1, o 2 }, and 3-dimensional state. The initial state is a vector s 0 = (0.75, 0, 0.25), and the following two matrices represent the change of state corresponding to each of the observations: M o1 M o1 = 0.5 M o2 = s cos(α) sin(α) 0 sin(α) cos(α) [cos(α) sin(α)] 1 0.5[cos(α) + sin(α)], performs the rotation of the state by an angle α around the first axis, and M o2 resets the state back to s 0. The predictions of the system are given by : P(o (1) o (2),..., o (k) ) = 1 M o (k) M o (1)s 0. The name of this example is due to the interesting behavior of the one step prediction of o 1 : if we observe only sequences of o 1 -s this prediction oscillates (the pattern is as in Fig. 1). The minimum number of states required for an HMM to represent this system exactly is the minimum k such that k α is a multiple of.
7 2π (Jaeger, 2000). Hence, depending on α, one might require an arbitrarily large number of states in the corresponding HMM for an exact representation; an infinite number of states is needed if α is an irrational degree. We now present several systems with control and reward functions that generalize the idea of the probability clock example to POMDPs. All the examples consider two-action (A = {a, b}) two-observation (O = {o 1, o 2 }) systems, such that policies with small memory perform (at times significantly) better than constant policies. The first and perhaps the most interesting scenario is a simple extension of the three dimensional probability clock. Let M a,o1 M o1, M a,o2 M o2, M b,o1 M b,o2 0.5 I, where I is an identity matrix of appropriate dimension. Hence, taking action a will result in the same behavior as that of the probability clock example. Taking action b, though, will result in an i.i.d sequence of coins flips - observations o 1, o 2 - but will not change the state. We equip this system with a reward signal and let its expectation be equal to P(o 1 s, a), where s is the current system s state. This specification satisfies the requirement of a linear PRP, since the expected reward is a linear function of the system s state. The optimal behavior is to take action a until the system reaches the state with largest P(o 1 s, a), then choose action b thereafter. The above example should be interpreted as the system that can operate under different modes. Action b runs the system in a given mode while action a changes the mode of the system in a stochastic fashion. Clearly, one can generalize this example to the setting in which the system s operation itself requires several hidden states to represent, more actions are involved, etc. Operating the system in different modes does not affect the dynamics between these states, but potentially affects the reward obtained from each state-action pair. Following the same reasoning as above, it is clear that such a system might require significantly more hidden states than number of dimensions, if represented in the POMDP framework. We note that the lac operon (see e.g. Chaves et al. (2008)), as well as other genetic networks related to metabolizing different types of nutrients, exhibit this type of multiple mode operation, with rewards dependent on the mode. The rest of the examples illustrate the flexibility of the probability clock with respect to the choice of parameters. The reward function is the same for all cases: 1 for observation o 1 and 0 for o 2 ; hence, the objective is to maximize the average probability of o 1 per step. As before, we let o 1 be the observation that rotates the state and o 2 be the reset. Figure 1 illustrates a system in which both actions behave as probability clocks starting from the same state, but having different rotation angles. The policy that chooses either action a or b at all times obtains an average reward of 0.69, while the policy that takes action a three times given that no reset occurred, and then b until the next reset, obtains an average reward of Figure 2 illustrates an example in which two actions can have different resetting states, rotation angles and magnitudes of the cycles. A constant policy choosing one action at all times will obtain average reward less than Yet, the policy that climbs the hill using action b and, once reset, chooses action a until the next reset, achieves an average reward of Such a policy requires only two internal states. The behavior of the average reward as a function of some of the two-state policy parameters is also presented in Figure 2. Another example of a system whose actions rotate the state in opposite directions and have slightly different resetting states can be found in the supplementary material depicted in Figure 3. A constant policy choosing one action at all times will obtain average reward 0.43, while the policy that changes its action right after observing a reset achieves average reward of As in the previous example, such a policy requires only two internal states, with average reward behaving smoothly as shown in Figure 3. All these examples are meant to give a sense of a type of systems that can benefit from the linear PRP representation. Although these systems require a much larger number of hidden states to be represented in the POMDP framework, they only require a 3- dimensional linear PRP representation. This guarantees that the average reward is a simple function of a policy with small memory, suggesting that appropriate policy search techniques can be very efficient. 5. Conclusion and future work We proposed a formal definition for the reward process based on the linear PSR framework, and analyzed some of its properties. We proved that the average reward is a rational function of the policy parameters, and its complexity depends on the dimension of the underlying linear PSR. This suggests that systems represented by a small linear PRP should be amenable to effective policy search techniques. For example, one can use gradient-based policy search methods such as GPOMDP (Baxter & Bartlett, 2001) to efficiently find
8 Figure 1. A system which rotates the state by a different angle for each action: a 45, b 18. a policy with an acceptable performance. However, previous policy search methods do not seem to exploit the shape of the average reward function derived in this paper. As a result, this is a particularly interesting avenue for global policy search methods since properties like smoothness of the function are typically easy to exploit using this type of search. Moreover, the knowledge about the possible dimensionality of the system can boost the policy search even more. The development of suitable approaches and their analysis remains the main direction for future work. Another direction currently under investigation is the analysis of the behavior of the average reward function for policies dependent on the predictive sufficient statistic of the process, i.e. the underlying state of the linear PSR (Aberdeen et al., 2007). This analysis could further shed light on how the error in an approximated PSR state affects the policy performance. Finally, the established result on the stationary distribution of a linear PSR can also be useful in the future development of stability guarantees for linear PSRs. Figure 2. The first two plots describe the behavior of the system: 1) it rotates the state by a different angle for each action: a 18, b 2 ; 2) the resetting state for action a corresponds to a basin of the cycle, while the resetting state for action b corresponds to a near top of the cycle; 3) the magnitudes of the periods are different between the actions. The bottom plot demonstrates how the average reward changes as a function of α and β, where the policies having 2 hidden states (S {1, 2}) are parametrized as: P π(a S = 1) = P π(b S = 2) = 1, P π(s = 2 S = 1, a) = P π(s = 1 S = 2, a) = α, P π(s = 1 S = 1, b) = P π(s = 2 S = 2, b) = β.
9 References Aberdeen, D., Buffet, O., and Thomas, O. Policygradients for PSRs and POMDPs. In Proceedings of international conference on artificial intelligence and statistics, Baxter, J. and Bartlett, P.L. Infinite-horizon policygradient estimation. Journal of Artificial Intelligence Research, 15: , Boots, B., Siddiqi, S.M., and Gordon, G.J. Closing the learning-planning loop with predictive state representations. In Proceedings of the 9th international conference on autonomous agents and multiagent systems, pp , Chaves, M., Eissing, T., and Allgower, F. Bistable biological systems: A characterization through local compact input-to-state stability. Automatic Control, IEEE Transactions on, 53(Special Issue):87 100, Faigle, U. and Schonhuth, A. Asymptotic mean stationarity of sources with finite evolution dimension. Information Theory, IEEE Transactions on, 53(7): , Gray, R.M. Probability, random processes, and ergodic properties. Springer, Gray, R.M. and Kieffer, JC. Asymptotically mean stationary measures. The Annals of Probability, 8(5): , Grinberg, Y. and Precup, D. On average reward policy evaluation in infinite state partially observable systems. In Proceedings of international conference on artificial intelligence and statistics, Izadi, M.T. and Precup, D. A planning algorithm for predictive state representations. In Proceedings of the 18th international joint conference on artificial intelligence, pp Morgan Kaufmann Publishers Inc., Kaelbling, L.P., Littman, M.L., and Cassandra, A.R. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101(1):99 134, Littman, M.L., Sutton, R.S., and Singh, S. Predictive representations of state. Advances in neural information processing systems, 14: , Peters, J. and Schaal, S. Natural actor-critic. Neurocomputing, 71(7): , Rosencrantz, M., Gordon, G., and Thrun, S. Learning low dimensional predictive representations. In Proceedings of the twenty-first international conference on machine learning, pp. 88, Schweitzer, P.J. Perturbation theory and finite Markov chains. Journal of Applied Probability, pp , Singh, S., James, M.R., and Rudary, M.R. Predictive state representations: A new theory for modeling dynamical systems. In Proceedings of the 20th conference on uncertainty in artificial intelligence, pp AUAI Press, Singh, S.P., Jaakkola, T., and Jordan, M.I. Learning without state-estimation in partially observable Markovian decision processes. In Proceedings of the eleventh international conference on machine learning, volume 31, pp. 37. Citeseer, Sutton, R. sutton/book/errata.html, Wiewiora, E.W. Modeling probability distributions with predictive state representations. PhD thesis, The University of California at San Diego, Yu, H. and Bertsekas, D.P. On near optimality of the set of finite-state controllers for average cost POMDP. Mathematics of Operations Research, 33 (1):1 11, Jaeger, H. Observable operator models for discrete stochastic time series. Neural Computation, 12(6): , James, M.R. Using predictions for planning and modeling in stochastic environments. PhD thesis, The University of Michigan, James, M.R., Singh, S., and Littman, M.L. Planning with predictive state representations. In The 2004 international conference on machine learning and applications, 2004.
On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets
On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets Pablo Samuel Castro pcastr@cs.mcgill.ca McGill University Joint work with: Doina Precup and Prakash
More informationPredictive Timing Models
Predictive Timing Models Pierre-Luc Bacon McGill University pbacon@cs.mcgill.ca Borja Balle McGill University bballe@cs.mcgill.ca Doina Precup McGill University dprecup@cs.mcgill.ca Abstract We consider
More informationCombining Memory and Landmarks with Predictive State Representations
Combining Memory and Landmarks with Predictive State Representations Michael R. James and Britton Wolfe and Satinder Singh Computer Science and Engineering University of Michigan {mrjames, bdwolfe, baveja}@umich.edu
More informationLearning Low Dimensional Predictive Representations
Learning Low Dimensional Predictive Representations Matthew Rosencrantz MROSEN@CS.CMU.EDU Computer Science Department, Carnegie Mellon University, Forbes Avenue, Pittsburgh, PA, USA Geoff Gordon GGORDON@CS.CMU.EDU
More informationAn Adaptive Clustering Method for Model-free Reinforcement Learning
An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at
More informationDialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department
Dialogue management: Parametric approaches to policy optimisation Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department 1 / 30 Dialogue optimisation as a reinforcement learning
More informationModelling Sparse Dynamical Systems with Compressed Predictive State Representations
with Compressed Predictive State Representations William L. Hamilton whamil3@cs.mcgill.ca Reasoning and Learning Laboratory, School of Computer Science, McGill University, Montreal, QC, Canada Mahdi Milani
More informationState Space Abstraction for Reinforcement Learning
State Space Abstraction for Reinforcement Learning Rowan McAllister & Thang Bui MLG, Cambridge Nov 6th, 24 / 26 Rowan s introduction 2 / 26 Types of abstraction [LWL6] Abstractions are partitioned based
More informationA Nonlinear Predictive State Representation
Draft: Please do not distribute A Nonlinear Predictive State Representation Matthew R. Rudary and Satinder Singh Computer Science and Engineering University of Michigan Ann Arbor, MI 48109 {mrudary,baveja}@umich.edu
More informationReinforcement Learning
Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha
More informationOpen Theoretical Questions in Reinforcement Learning
Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem
More informationUsing Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *
Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationLecture 3: Markov Decision Processes
Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov
More informationReinforcement Learning and Deep Reinforcement Learning
Reinforcement Learning and Deep Reinforcement Learning Ashis Kumer Biswas, Ph.D. ashis.biswas@ucdenver.edu Deep Learning November 5, 2018 1 / 64 Outlines 1 Principles of Reinforcement Learning 2 The Q
More informationA Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation
A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation Richard S. Sutton, Csaba Szepesvári, Hamid Reza Maei Reinforcement Learning and Artificial Intelligence
More informationArtificial Intelligence
Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important
More informationA Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time
A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement
More informationOptimal Convergence in Multi-Agent MDPs
Optimal Convergence in Multi-Agent MDPs Peter Vrancx 1, Katja Verbeeck 2, and Ann Nowé 1 1 {pvrancx, ann.nowe}@vub.ac.be, Computational Modeling Lab, Vrije Universiteit Brussel 2 k.verbeeck@micc.unimaas.nl,
More informationExponential Moving Average Based Multiagent Reinforcement Learning Algorithms
Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca
More informationBalancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm
Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu
More informationState Space Compression with Predictive Representations
State Space Compression with Predictive Representations Abdeslam Boularias Laval University Quebec GK 7P4, Canada Masoumeh Izadi McGill University Montreal H3A A3, Canada Brahim Chaib-draa Laval University
More informationPOMDPs and Policy Gradients
POMDPs and Policy Gradients MLSS 2006, Canberra Douglas Aberdeen Canberra Node, RSISE Building Australian National University 15th February 2006 Outline 1 Introduction What is Reinforcement Learning? Types
More informationLearning in Zero-Sum Team Markov Games using Factored Value Functions
Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer
More informationBasis Construction from Power Series Expansions of Value Functions
Basis Construction from Power Series Expansions of Value Functions Sridhar Mahadevan Department of Computer Science University of Massachusetts Amherst, MA 3 mahadeva@cs.umass.edu Bo Liu Department of
More informationSpectral Learning of Predictive State Representations with Insufficient Statistics
Spectral Learning of Predictive State Representations with Insufficient Statistics Alex Kulesza and Nan Jiang and Satinder Singh Computer Science & Engineering University of Michigan Ann Arbor, MI, USA
More informationPolicy Gradient Reinforcement Learning for Robotics
Policy Gradient Reinforcement Learning for Robotics Michael C. Koval mkoval@cs.rutgers.edu Michael L. Littman mlittman@cs.rutgers.edu May 9, 211 1 Introduction Learning in an environment with a continuous
More informationReinforcement Learning
Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University
More informationEfficient Learning and Planning with Compressed Predictive States
Journal of Machine Learning Research 15 (2014) 3575-3619 Submitted 12/13; Revised 7/14; Published 11/14 Efficient Learning and Planning with Compressed Predictive States William Hamilton Mahdi Milani Fard
More informationOptimism in the Face of Uncertainty Should be Refutable
Optimism in the Face of Uncertainty Should be Refutable Ronald ORTNER Montanuniversität Leoben Department Mathematik und Informationstechnolgie Franz-Josef-Strasse 18, 8700 Leoben, Austria, Phone number:
More informationAbstraction in Predictive State Representations
Abstraction in Predictive State Representations Vishal Soni and Satinder Singh Computer Science and Engineering University of Michigan, Ann Arbor {soniv,baveja}@umich.edu Abstract Most work on Predictive
More informationExponential Moving Average Based Multiagent Reinforcement Learning Algorithms
Artificial Intelligence Review manuscript No. (will be inserted by the editor) Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Howard M. Schwartz Received:
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationEmphatic Temporal-Difference Learning
Emphatic Temporal-Difference Learning A Rupam Mahmood Huizhen Yu Martha White Richard S Sutton Reinforcement Learning and Artificial Intelligence Laboratory Department of Computing Science, University
More informationQ-Learning in Continuous State Action Spaces
Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental
More informationReinforcement Learning for NLP
Reinforcement Learning for NLP Advanced Machine Learning for NLP Jordan Boyd-Graber REINFORCEMENT OVERVIEW, POLICY GRADIENT Adapted from slides by David Silver, Pieter Abbeel, and John Schulman Advanced
More information1 Problem Formulation
Book Review Self-Learning Control of Finite Markov Chains by A. S. Poznyak, K. Najim, and E. Gómez-Ramírez Review by Benjamin Van Roy This book presents a collection of work on algorithms for learning
More informationFinite-State Controllers Based on Mealy Machines for Centralized and Decentralized POMDPs
Finite-State Controllers Based on Mealy Machines for Centralized and Decentralized POMDPs Christopher Amato Department of Computer Science University of Massachusetts Amherst, MA 01003 USA camato@cs.umass.edu
More informationReinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN
Reinforcement Learning for Continuous Action using Stochastic Gradient Ascent Hajime KIMURA, Shigenobu KOBAYASHI Tokyo Institute of Technology, 4259 Nagatsuda, Midori-ku Yokohama 226-852 JAPAN Abstract:
More informationCovariant Policy Search
Covariant Policy Search J. Andrew Bagnell and Jeff Schneider Robotics Institute Carnegie-Mellon University Pittsburgh, PA 15213 { dbagnell, schneide } @ ri. emu. edu Abstract We investigate the problem
More informationPlanning with Predictive State Representations
Planning with Predictive State Representations Michael R. James University of Michigan mrjames@umich.edu Satinder Singh University of Michigan baveja@umich.edu Michael L. Littman Rutgers University mlittman@cs.rutgers.edu
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationilstd: Eligibility Traces and Convergence Analysis
ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca
More information6 Reinforcement Learning
6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,
More informationPlanning in POMDPs Using Multiplicity Automata
Planning in POMDPs Using Multiplicity Automata Eyal Even-Dar School of Computer Science Tel-Aviv University Tel-Aviv, Israel 69978 evend@post.tau.ac.il Sham M. Kakade Computer and Information Science University
More informationArtificial Intelligence & Sequential Decision Problems
Artificial Intelligence & Sequential Decision Problems (CIV6540 - Machine Learning for Civil Engineers) Professor: James-A. Goulet Département des génies civil, géologique et des mines Chapter 15 Goulet
More informationLearning to Coordinate Efficiently: A Model-based Approach
Journal of Artificial Intelligence Research 19 (2003) 11-23 Submitted 10/02; published 7/03 Learning to Coordinate Efficiently: A Model-based Approach Ronen I. Brafman Computer Science Department Ben-Gurion
More informationFinite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results
Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results Chandrashekar Lakshmi Narayanan Csaba Szepesvári Abstract In all branches of
More information10 Robotic Exploration and Information Gathering
NAVARCH/EECS 568, ROB 530 - Winter 2018 10 Robotic Exploration and Information Gathering Maani Ghaffari April 2, 2018 Robotic Information Gathering: Exploration and Monitoring In information gathering
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationA System Theoretic Perspective of Learning and Optimization
A System Theoretic Perspective of Learning and Optimization Xi-Ren Cao* Hong Kong University of Science and Technology Clear Water Bay, Kowloon, Hong Kong eecao@ee.ust.hk Abstract Learning and optimization
More informationLearning to Make Predictions In Partially Observable Environments Without a Generative Model
Journal of Artificial Intelligence Research 42 (2011) 353-392 Submitted 5/11; published 11/11 Learning to Make Predictions In Partially Observable Environments Without a Generative Model Erik Talvitie
More informationPrioritized Sweeping Converges to the Optimal Value Function
Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science
More informationMulti-Attribute Bayesian Optimization under Utility Uncertainty
Multi-Attribute Bayesian Optimization under Utility Uncertainty Raul Astudillo Cornell University Ithaca, NY 14853 ra598@cornell.edu Peter I. Frazier Cornell University Ithaca, NY 14853 pf98@cornell.edu
More informationA Decentralized Approach to Multi-agent Planning in the Presence of Constraints and Uncertainty
2011 IEEE International Conference on Robotics and Automation Shanghai International Conference Center May 9-13, 2011, Shanghai, China A Decentralized Approach to Multi-agent Planning in the Presence of
More informationAn Empirical Algorithm for Relative Value Iteration for Average-cost MDPs
2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter
More informationA Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies
A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies Huizhen Yu Lab for Information and Decision Systems Massachusetts Institute of Technology Cambridge,
More informationDirect Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms
Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms Jonathan Baxter and Peter L. Bartlett Research School of Information Sciences and Engineering Australian National University
More informationReinforcement Learning
Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value
More informationLinearly-solvable Markov decision problems
Advances in Neural Information Processing Systems 2 Linearly-solvable Markov decision problems Emanuel Todorov Department of Cognitive Science University of California San Diego todorov@cogsci.ucsd.edu
More informationKalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN)
Kalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN) Alp Sardag and H.Levent Akin Bogazici University Department of Computer Engineering 34342 Bebek, Istanbul,
More informationCoarticulation in Markov Decision Processes Khashayar Rohanimanesh Robert Platt Sridhar Mahadevan Roderic Grupen CMPSCI Technical Report 04-33
Coarticulation in Markov Decision Processes Khashayar Rohanimanesh Robert Platt Sridhar Mahadevan Roderic Grupen CMPSCI Technical Report 04- June, 2004 Department of Computer Science University of Massachusetts
More informationStochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions
International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.
More informationLearning Control Under Uncertainty: A Probabilistic Value-Iteration Approach
Learning Control Under Uncertainty: A Probabilistic Value-Iteration Approach B. Bischoff 1, D. Nguyen-Tuong 1,H.Markert 1 anda.knoll 2 1- Robert Bosch GmbH - Corporate Research Robert-Bosch-Str. 2, 71701
More informationThe convergence limit of the temporal difference learning
The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector
More informationPolicyBoost: Functional Policy Gradient with Ranking-Based Reward Objective
AI and Robotics: Papers from the AAAI-14 Worshop PolicyBoost: Functional Policy Gradient with Raning-Based Reward Objective Yang Yu and Qing Da National Laboratory for Novel Software Technology Nanjing
More informationLecture notes for Analysis of Algorithms : Markov decision processes
Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with
More information, and rewards and transition matrices as shown below:
CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount
More informationLecture 3: The Reinforcement Learning Problem
Lecture 3: The Reinforcement Learning Problem Objectives of this lecture: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which
More informationLearning and Planning with Timing Information in Markov Decision Processes
Learning and Planning with Timing Information in Markov Decision Processes Pierre-Luc Bacon, Borja Balle, and Doina Precup Reasoning and Learning Lab, School of Computer Science, McGill University {pbacon,bballe,dprecup}@cs.mcgill.ca
More informationChapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which
More informationCoarticulation in Markov Decision Processes
Coarticulation in Markov Decision Processes Khashayar Rohanimanesh Department of Computer Science University of Massachusetts Amherst, MA 01003 khash@cs.umass.edu Sridhar Mahadevan Department of Computer
More informationarxiv: v1 [cs.ai] 1 Jul 2015
arxiv:507.00353v [cs.ai] Jul 205 Harm van Seijen harm.vanseijen@ualberta.ca A. Rupam Mahmood ashique@ualberta.ca Patrick M. Pilarski patrick.pilarski@ualberta.ca Richard S. Sutton sutton@cs.ualberta.ca
More informationActive Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning
Active Policy Iteration: fficient xploration through Active Learning for Value Function Approximation in Reinforcement Learning Takayuki Akiyama, Hirotaka Hachiya, and Masashi Sugiyama Department of Computer
More informationI D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69
R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual
More informationOutline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012
CSE 573: Artificial Intelligence Autumn 2012 Reasoning about Uncertainty & Hidden Markov Models Daniel Weld Many slides adapted from Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer 1 Outline
More informationTemporal Difference Learning & Policy Iteration
Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.
More informationState Space Abstractions for Reinforcement Learning
State Space Abstractions for Reinforcement Learning Rowan McAllister and Thang Bui MLG RCC 6 November 24 / 24 Outline Introduction Markov Decision Process Reinforcement Learning State Abstraction 2 Abstraction
More informationLearning Multi-Step Predictive State Representations
Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16) Learning Multi-Step Predictive State Representations Lucas Langer McGill University Canada Borja Balle
More informationLogic, Knowledge Representation and Bayesian Decision Theory
Logic, Knowledge Representation and Bayesian Decision Theory David Poole University of British Columbia Overview Knowledge representation, logic, decision theory. Belief networks Independent Choice Logic
More informationUniversity of Alberta
University of Alberta NEW REPRESENTATIONS AND APPROXIMATIONS FOR SEQUENTIAL DECISION MAKING UNDER UNCERTAINTY by Tao Wang A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment
More informationIn Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.
In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and J. Alspector, (Eds.). Morgan Kaufmann Publishers, San Fancisco, CA. 1994. Convergence of Indirect Adaptive Asynchronous
More informationStatistics 992 Continuous-time Markov Chains Spring 2004
Summary Continuous-time finite-state-space Markov chains are stochastic processes that are widely used to model the process of nucleotide substitution. This chapter aims to present much of the mathematics
More informationEfficient Sensitivity Analysis in Hidden Markov Models
Efficient Sensitivity Analysis in Hidden Markov Models Silja Renooij Department of Information and Computing Sciences, Utrecht University P.O. Box 80.089, 3508 TB Utrecht, The Netherlands silja@cs.uu.nl
More informationIntroduction to Reinforcement Learning
CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.
More informationReduced-Rank Hidden Markov Models
Reduced-Rank Hidden Markov Models Sajid M. Siddiqi Byron Boots Geoffrey J. Gordon Carnegie Mellon University ... x 1 x 2 x 3 x τ y 1 y 2 y 3 y τ Sequence of observations: Y =[y 1 y 2 y 3... y τ ] Assume
More informationOptimal Control of Partiality Observable Markov. Processes over a Finite Horizon
Optimal Control of Partiality Observable Markov Processes over a Finite Horizon Report by Jalal Arabneydi 04/11/2012 Taken from Control of Partiality Observable Markov Processes over a finite Horizon by
More informationChapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which
More informationReinforcement Learning
Reinforcement Learning Function Approximation Continuous state/action space, mean-square error, gradient temporal difference learning, least-square temporal difference, least squares policy iteration Vien
More informationPartially Observable Markov Decision Processes (POMDPs)
Partially Observable Markov Decision Processes (POMDPs) Geoff Hollinger Sequential Decision Making in Robotics Spring, 2011 *Some media from Reid Simmons, Trey Smith, Tony Cassandra, Michael Littman, and
More informationMarkov Chains, Random Walks on Graphs, and the Laplacian
Markov Chains, Random Walks on Graphs, and the Laplacian CMPSCI 791BB: Advanced ML Sridhar Mahadevan Random Walks! There is significant interest in the problem of random walks! Markov chain analysis! Computer
More informationDual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks
Dual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks Kary Främling Helsinki University of Technology, PL 55, FI-25 TKK, Finland Kary.Framling@hut.fi Abstract. Reinforcement
More informationarxiv: v2 [math.oc] 13 Feb 2016
Geometry and Determinism of Optimal Stationary Control in Partially Observable Markov Decision Processes arxiv:537v [mathoc] 3 Feb Guido Montúfar Max Planck Institute for Mathematics in the Sciences 43
More informationAutonomous Helicopter Flight via Reinforcement Learning
Autonomous Helicopter Flight via Reinforcement Learning Authors: Andrew Y. Ng, H. Jin Kim, Michael I. Jordan, Shankar Sastry Presenters: Shiv Ballianda, Jerrolyn Hebert, Shuiwang Ji, Kenley Malveaux, Huy
More informationThe Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount
The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount Yinyu Ye Department of Management Science and Engineering and Institute of Computational
More informationPROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION PING HOU. A dissertation submitted to the Graduate School
PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION BY PING HOU A dissertation submitted to the Graduate School in partial fulfillment of the requirements for the degree Doctor of Philosophy Major Subject:
More informationCAP Plan, Activity, and Intent Recognition
CAP6938-02 Plan, Activity, and Intent Recognition Lecture 10: Sequential Decision-Making Under Uncertainty (part 1) MDPs and POMDPs Instructor: Dr. Gita Sukthankar Email: gitars@eecs.ucf.edu SP2-1 Reminder
More informationA Control-Theoretic Perspective on the Design of Distributed Agreement Protocols, Part
9. A Control-Theoretic Perspective on the Design of Distributed Agreement Protocols, Part Sandip Roy Ali Saberi Kristin Herlugson Abstract This is the second of a two-part paper describing a control-theoretic
More informationREINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning
REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari
More informationThe Markov Decision Process Extraction Network
The Markov Decision Process Extraction Network Siegmund Duell 1,2, Alexander Hans 1,3, and Steffen Udluft 1 1- Siemens AG, Corporate Research and Technologies, Learning Systems, Otto-Hahn-Ring 6, D-81739
More information