Estimating Passive Dynamics Distributions and State Costs in Linearly Solvable Markov Decision Processes during Z Learning Execution

Size: px

Start display at page:

Download "Estimating Passive Dynamics Distributions and State Costs in Linearly Solvable Markov Decision Processes during Z Learning Execution"

Daniella McCarthy
5 years ago
Views:

1 SICE Journal of Control, Measurement, and System Integration, Vol. 7, No. 1, pp , January 2014 Estimating Passive Dynamics Distributions and State Costs in Linearly Solvable Markov Decision Processes during Z Learning Execution Mauricio BURDELIS and Kazushi IKEDA Abstract : Although the framework of linearly solvable Markov decision processes (LMDPs) reduces the computational complexity in reinforcement learning, it requires the knowledge of the state-transition probability in the absence of control or passive dynamics. The passive dynamics can be estimated by a temporal difference method called Z learning if the environment obeys the passive dynamics. However, it leads to a slow convergence of learning since no control is allowed during learning. This paper proposed a method to estimate the passive dynamics using Z learning under a different statetransition probability from the passive dynamics. The proposed method requires only the knowledge on what states can be visited from each possible state, and estimates the state-transition probability as well as the immediate cost of the states from the constraints they should satisfy. The computer experiments showed that the proposed method remains more efficient than Q learning with successful estimation of the passive dynamics and state costs and has a comparable convergence speed with the traditional Z learning. Key Words : reinforcement learning, Bellman equation, linearly solvable Markov decision process. 1. Introduction Reinforcement learning (RL) is a machine learning technique used to learn how to take actions to achieve a desired goal. An agent is not given the correct action to take in each situation but only a reward according to the current state and the chosen action followed by a stochastic state-transition. This is regarded as a discrete time Markov decision process (MDP) with stochastic dynamics [1]. In other words, reinforcement learning can widely be applied to problems modeled by MDPs such as trajectory optimization [2],[3], robotics and control [4], mobile communication [5], image recognition [6],[7], e-commerce [8] and medical treatment [9]. The objective of reinforcement learning is to choose actions that minimize the expected total cumulative cost called the costto-go function. Since the values of the cost-to-go function at the current and the following states satisfy the Bellman equation, the function is given as the solution of the Bellman equation [1],[10]. Although solving the Bellman equation has an exponential computational complexity in general, Todorov gave the conditions of MDPs under which the Bellman equation became linear [11],[12]. In the conditions, the state-transition probability in the absence of control is crucial, which is called the passive dynamics. Todorov also proposed a method to estimate the cost-to-go function called Z learning. Z learning is a temporal difference method that converges faster than traditional methods such as Q learning [1],[11],[12]. When no control is given, the environment follows the passive dynamics and Z learning correctly estimates the cost-to-go function. Otherwise, Z learning requires the knowledge of the passive dynamics but it can balance the dilemma of exploration and exploitation. For example, it can Graduate School of Information Science, Nara Institute of Science and Technology (NAIST), Nara , Japan mauricio-b@is.naist.jp, kazushi@is.naist.jp (Received January 24, 2013) (Revised May 10, 2013) employ the greedy policy that chooses the optimal in the currently estimated knowledge. This paper, we proposed a method to estimate the passive dynamics and the immediate costs during the execution of Z learning with general dynamics. The proposed method estimates the state-transition probability using the immediate cost information observed by the agent during exploring the environment. Since the method updates the estimates step by step, it does not require all immediate costs beforehand or to follow the passive dynamics. This property allows the agent to use more efficient policy. The effectiveness of the method was confirmed by computer simulations, where Newtonian dynamics in two-dimensional grid world were considered including a simple model of inertia and collisions. The method showed performance comparable to greedy Z learning in the convergence speed of the estimates of the cost-to-go function, and better performance than Q learning. This paper is organized as follows. In section 2 we give a review of the traditional MDPs theory, and the framework of LMDPs is explained, as well as the sufficient conditions for its existence. Section 3 describes the proposed method for calculating the passive dynamics from observed costs, and explains how to apply it during the execution of Z learning. Section 4 presents computational experiments and results. In section 5 practical difficulties are discussed. Finally, section 6 presents brief concluding remarks. 2. Linearly Solvable Markov Decision Processes Suppose that the environment of a reinforcement learning problem is a discrete time Markov decision process, that is, Pr (x t+1 u t, x t ) = Pr (x t+1 u t, x t, x t 1, x t 2...) (1) where x t and u t denote respectively the state of the agent and the taken action at time step t. Then, the problem of reinforcement learning is to give the optimal probability p(u x)ofactions JCMSI 0001/14/ c 2013 SICE

2 SICE JCMSI, Vol. 7, No. 1, January u U for the current state x X where U and X denote the possible sets of actions and states, respectively. If the agent takes an action u for a state x, the state changes to according to the state transition probability p( x, u) and the agent pays the immediate cost l(x, u). The optimality of actions here means minimizing the expected total cumulative cost v(x) from a state x until the agent reaches a terminal or goal state [1],[10],[11]. Hereafter, v(x) is termed the cost-to-go function according to Todorov s work. It is known that v(x) must satisfy v (x) = min u { [ ( l (x, u) + Ep( x,u) v x )] } which is called the Bellman equation. E p( x,u) [v ( )] means the statistical expectation of v ( ) taken with respect to p ( x, u). Some methods such as Dynamic Programming [10] or Reinforcement Learning [1] can solve the Bellman equation. However, these can be time consuming due to explosions of the number of unknown variables because the number of future states grows exponentially with time. Todorov showed that the class of linearly solvable Markov decision processes (LMDPs) greatly simplifies reinforcement learning [11],[12]. When specific conditions are met, the Bellman equation of an MDP becomes linear and the problem reduces to an eigenvector problem. We review these facts according to [11],[12]. There are two conditions for an MDP to be a linear Bellman equation. One is that the action u can directly specify the state transition probability. That is, state transition probability p( x, u) is represented as u( x). The other is that the immediate cost is given by the sum of the action cost and the state cost, where the action cost is measured by the Kullback-Leibler (KL) divergence from the passive dynamics p d ( x)tothecurrent transition probability u( x), that is, [ KL (u ( x) p d ( x)) E u( x) log u ] (x x) p d ( (3) x) where E u( x) means the statistical expectation of taken with respect to the controlled transition distribution u ( x). The passive dynamics is the transition probability that corresponds to the behavior of the system in the absence of controls. In theory it corresponds to a reference distribution which makes the KL divergence above null, and it can also be arbitrary. It is usually defined as random walk. The state cost depends only on the current state x and hence is denoted by q(x). In total, the immediate cost is expressed as (2) l (x, u) = q (x) + KL (u (. x) p d (. x)). (4) Todorov introduced the desirability function z (x) exp ( v (x)) (5) instead of considering v(x) itself. Then, under the conditions mentioned above, the Bellman equation (2) can be reduced to z(x) = exp ( q (x)) G [z] (x) (6) where G [z] (x) ( p d x ) z ( ). (7) Note that (6) is linear in z and the optimal controlled transition probability u is given by u ( x ) = p d ( x) z ( ) [ pd ( x) z ( ]. (8) ) The class of linearly solvable MDPs is restricted because of the conditions that must be satisfied, but they are important in reinforcement learning [11] [15]. If the state costs q(x) and passive dynamics p d are not known, they must be learned through the agent s exploration of the environment. One learning method is Z learning, that is a temporal difference method. It has the same benefits as other temporal difference methods like Q learning [1], being an off-policy method and being able to absorb small errors in the measurement of the immediate cost, but it has the advantage of faster convergence. When the agent follows the passive dynamics, Z learning updates the desirability function z as z new (x t ) (1 η t ) z cur (x t ) + η t exp ( q t ) z cur (x t+1 ) (9) where z new (x t ) is the new estimate of z at the current state x t, z cur (x t )andz cur (x t+1 ) are the current estimate of z(x t )and z(x t+1 ), respectively, q t is the state cost of the current state x t, and η t is a learning rate that decreases over time. The observed immediate cost l(x, p d ) is equal to the state cost q t because the KL divergence (3) is null, so q t can be obtained directly by observing the immediate cost l. When the agent follows a controlled transition probability û, we need to introduce the importance sampling technique to (9), that is, z new (x t ) (1 η t ) z cur (x t ) + η t exp ( q t ) z cur (x t+1 ) p d(x t+1 x t ) û(x t+1 x t ). (10) This means that we can use a more efficient policy than the passive dynamics. In the case of the greedy Z learning, for example, û is the policy which appears optimal given the current estimates of ẑ according to (8). However, this method requires the knowledge on the passive dynamics p d beforehand. Because the policy is different from the passive dynamics, the KL divergence will not be null, and the observed costs l will not be necessarily equal to the state costs q. Hence, those must be known beforehand, or a method to measure q separately must be proposed. 3. Passive Dynamics Estimation To make Z learning applicable under a controlled condition, we need the knowledge of the passive dynamics and state costs. We propose a method for estimating them from measured immediate costs l(x, u) using their constraints [16],[17]. This way Z learning can be applied by measuring only immediate costs and updating estimates of the quantities of interest, as in other temporal difference methods such as Q learning. Suppose that a discrete state space X has cardinality X = N S (Fig. 1). Our method regards the log of each state transition probability, log p d ( x), as unknown variables and each state cost q(x) as well. Then, the variables and the immediate costs must satisfy (4), or more concretely

3 50 SICE JCMSI, Vol. 7, No. 1, January 2014 Fig. 1 A state space with N S possible states and NS 2 possible transitions. Algorithm 1: Gradient Descent (with probability normalization) 1: Take an initial solution m t=1 using the Moore-Penrose pseudoinverse matrix A + GD of matrix A GD m t=1 A + GD b 2: repeat 3: Take a step of the gradient descent algorithm m t+1 m t + 1 γ AT GD (b A GD.m t ) where γ>1 4: Normalize the probabilities ( to sum up ) to one for i = 1toN V : m i log 5: until convergence Fig. 2 exp m i NV j=1 exp m j A pseudo code of the proposed Gradient Descent with probability normalization algorithm. Algorithm 2: Variable Substitution Method 1: by observing that A VS is a stochastic matrix, rewrite: A VS (q1 n) = b 2: change of variables: (q1 n) = c 3: solve A VS c = b for c 4: by observing that N V i=1 exp (n i) = 1 q log ( NV i=1 exp ( c i) ) 5: solve (q1 n) = c for n Fig. 3 A pseudo code of the variable substitution algorithm. ( u(x l(x, u) = u( ) x) x)log p d ( + q(x) x) = u( x)logu( x) x u( x)logp d ( x) + q(x) (11) for any x. Rearranging the terms we have: u(x x)logp d ( x) + q(x) = l(x, u) u(x x)logu( x) (12) Because we know log u( x) and can measure l(x, u), we get N eq = N S linear equations of log p d ( x) andq(x) from N S distinct x with an arbitrarily fixed u (where N S is the number of states, N eq denotes the number of equations). Hence, repeating the procedure for N S different controlled distributions u, say u 1,...,u Ns,wegetasystemofN eq = NS 2 linear equations. This number of equations is less than the number of unknown variables, NS 2 + N S. However, the probability p d ( x) forx has an additional constraint, p d ( x) = 1, (13) for any x. Although these are not linear in log p d ( x), the NS 2 + N S equations are easily solved by a gradient method (Fig. 2) or a variable substitution method (Fig. 3). For both algorithms, in the general case, applying less than N S different controlled distributions u obtains less equations Algorithm 3: Calculating the passive dynamics distributions and state costs during Z learning 1: When each state x i is visited (following policy û t ): 2: if p d ( x i )andq(x i ) are not yet known then 3: gather one equation from the measured l (x i û t ): l(x i, û t )= û t ( x i )logû t ( x i ) x û t ( x i )logp d ( x i ) + q(x i ) 4: N eq (x i ) N eq (x i ) + 1 5: if N eq (x i ) < N V (x i ) then 6: get one solution for the incomplete system with N eq equations, under the constraint: p d ( x i ) = 1 7: else 8: solve the complete system (with N eq = N V (x i ) equations), under the constraint: p d ( x i ) = 1 9: consider that p d ( x i )andq(x i ) are known 10: end if 11: end if Fig. 4 A pseudo code of the proposed algorithm. than necessary, and applying more than N S is unnecessary under the constraint (13). In the gradient descent algorithm (Fig. 2), the system is written in vector notation A GD m = b, wherea GD corresponds to the matrix of coefficients, m corresponds to the variables array, and b corresponds to the constants array. Each equation of the system is of the form of (12). The elements m 1,..., m NV of m are log (p d ( x)) for all valid at x, and the last element m NV +1 is q(x). N V is the number of valid possible future states at x (so the length of m is N V + 1). The term 1 γ corresponds to the step size of the gradient descent algorithm. Each element of the constants vector b corresponds to the right side of equation (12). The matrix ( A GD has N V lines and N V + 1 columns, with elements u i x j, x) at line i and column j for columns 1 to N V. The rightmost column (N V + 1) consists only of ones, which are the coefficients of the variable q(x). In the variable substitution algorithm (Fig. 3), the system is also written in vector notation q1 A VS n = b,wherea VS corresponds to thematrix of coefficients, n corresponds to the variables array (similar to m, but without q), q is the state cost of current state x, andb corresponds to the constants array (the same from the gradient descent algorithm). n 1,...,n NV are the elements of n and c 1,...,c NV, are the elements of c. The matrix A VS is the negative of A GD without the column of ones (coefficients of the variable q(x)). When q(x) is known, the problem becomes easier. If we consider the time minimization to reach a goal, for example, q(x) takes a constant value. Then, (11) reduces to a system of linear equations, l(x, u) = u( x)logu( x) x u( x)logp d ( x) + c. (14) Our algorithm can apply during Z learning with any policy such as greedy. In fact, we need not to fix u in (14). On the contrary, changeable u is more preferrable to acquire a well-conditioned equation system. Hence, (10) is replaced with

4 SICE JCMSI, Vol. 7, No. 1, January z new (x t ) (1 η t ) z cur (x t ) + η t exp ( ˆq (x t )) z cur (x t+1 ) ˆp d (x t+1 x t ) û (x t+1 x t ), (15) where ˆp d (x t+1 x t ) and ˆq (x t ) are the current estimates. The algorithm for p d and q estimation during Z learning is illustrated in Fig. 4. In the figure, û t corresponds to the policy which appears optimal (according to (8)) given the current estimates z cur (x i ), pˆ d ( x i )andˆq(x i ) at time t. N V (x i ) is the number of valid possible future states at x i. In order to get one solution to the incomplete system of equations (step 6 in Fig. 4), we can use either the gradient descent method (Fig. 2) or the variable substitution method (Fig. 3). The latter is preferable because it is not iterative and can find the solution directly, but for its application to an incomplete system we take the Moore-Penrose pseudoinverse of matrix A VS (step 3 in Fig. 3) and multiply it by b in order to obtain c. The gradient descent method (Fig. 2) can be used without any change. 4. Computational Experiments To validate our method, three experiments were carried out. The first one showsthe efficiency of Z learning with the greedy policy (greedy Z learning) compared to the Z learning with the passive dynamics (passive Z learning) under the condition that the passive dynamics is known. The second experiment confirms that our method can estimate the passive dynamics and state costs correctly under a controlled condition. The third experiment shows that our method works well during the execusion of our modified Z learning. The environment for all the experiments consisted of a twodimensional 10x10-size grid world with obstacles (Fig. 5). The task of the agent in the environment was to reach the goal position (lower right in Fig. 5) from a random start position at the grid as fast as possible. To do so, the state cost of every state was set to unity. This means the total cost tends to be big if the trajectory chosen by the agent is long. Note that one possible application for LMDPS is finding shortest paths in graphs [11]. The agent obeyed Newtonian mechanics, that is, it could only move from its current position to an adjacent position at each time step. Each state of the agent consisted of the pair of the current and the previous positions. This was because Newtonian mechanics is determined with the pair (previous and current positions) in discrete time steps. In addition, the pair can express Newtonian mechanics as a Markov decision process (1). Now, we have N P = 86 positions and N S = 575 possible states. As for the walls and obstacles, we considered two cases: the obstacles were reflexive walls in one case and absorptive walls in the other. In both scenarios the passive dynamics is the same in open spaces (Fig. 6), where the agent obeys Newtonian mechanics with high probability hp = 0.9 and takes another state with low probability. The passive dynamics for the absorptive scenario and the reflexive scenario differ when there are walls or obstacles (Fig. 7). In all experiments only transitions to adjacent positions were possible, and hence the number of valid future states N V corresponded to the number of adjacent positions to the current position. 4.1 Efficiency of Greedy Z Learning To confirm the efficiency of Z learning with other dynamics than the passive dynamics, we compared the learning curves of the greedy Z learning and the passive Z learning assuming that the passive dynamics p d and state costs q were known. In our experiments, the learning rate η t at time step t decays obeying η t = c/(c + t), where c = 10, 000 for the greedy Z learning and c = 30, 000 for the passive Z learning. The estimation error was calculated as the normalized difference between the estimated cost-to-go function ˆv and the optimal one v obtained analytically using (6), that is, NS i=1 ˆv (x i) v (x i ) NS. (16) i=1 v (x i ) The learning curves in Figs. 8 (a) and 8 (b) show Z learning converges faster when following the greedy policy than when following the passive dynamics. Hence, the greedy Z learning is more efficient. Note that these results are consistent in the experiments in [11],[12]. 4.2 Estimation of Passive Dynamics and State Costs To confirm that the proposed method can correctly estimate the passive dynamics p d and the state costs q from measured total immediate costs l, we ran Algorithms 1 and 2 with a controlled transition probability u. Here, u was set so that one adjacent state had a large probability, the other adjacent states had Fig. 6 Modeling inertia. Fig. 5 The environment of our experiments. Fig. 7 Modeling collisions (hp = 0.9).

5 52 SICE JCMSI, Vol. 7, No. 1, January 2014 Fig. 10 Box plots of the errors in estimation of passive dynamics and state costs (estimation from action costs, from total costs with gradient descent method, and from total costs with variable substitution method). Fig. 8 Learning curves for the reflexive and the absorptive environments. Fig. 9 A controlled transition probability u. a low probability and the rest were zero as seen in (Fig. 9), where the squares represent positions and the darker one the current position, and the numbers therein represent probabilities of transition (the probabilities depend on the number of adjacent states). The estimation error was calculated as the difference between the estimated probabilities ˆp d and the correct passive dynamics p d, that is, ˆpd (x i ) p d (x i). (17) In a similar way, the errors of the state costs were calculated as ˆq (x i ) q (x i ). (18) The results show that the errors are almost within the numerical precision of the simulator software in both algorithms (Fig. 10). This means that the method can calculate the passive dynamics p d and state costs q correctly. 4.3 Z Learning with Passive Dynamics and State Costs Estimation To confirm that the greedy Z learning performs well during estimation of the passive dynamics p d and the state costs q,the authors ran Algorithm 1 with a controlled transition probability û made from the current estimates ˆp d and ˆq and compared with the greedy Z learning with the optimal control u and the traditional Q learning (the ɛ-greedy policy with ɛ = 0.1). In the Fig. 11 Estimation errors of p d and q, with reflexive walls and with absorptive walls. experiments, the learning rate η t at time step t decays obeying η t = c/(c + t), where c = 10, 000 for the greedy Z learning and c = 200, 000 for the Q learning. Each simulation was repeated ten times. The errors of the cost-to-go v were calculated as (16). In the same way, the errors of the estimated passive dynamics and the state costs were calculated as NS i=1 NS j=1 NS i=1 ( ) ( ) pˆ d x j x i p d x j x i ( ), (19) x j x i NS j=1 p d NS i=1 ˆq (x i) q (x i ) NS i=1 q (x i ) (20) respectively. Even during the greedy Z learning with the estimates of p d and q, the errors of the estimated passive dynamics and the state costs consistently reduced as the number of simulation steps progresses (Fig. 11). Moreover, the errors of these quantities did not have a significant impact on the convergence speed of the Z learning algorithm (Fig. 12). The difference between the errors of Z learning with p d and q estimation and traditional Z learing can be observed in Fig. 13. It is possible to observe in Fig. 11 that the errors of the estimated p d and q become very small near simulation time step Before this time step the errors of v for Z learning with p d and q estimation are bigger (as seen in

6 SICE JCMSI, Vol. 7, No. 1, January Discussions Although the proposed method was successful in the experiments, there is a difficulty in practical applications. It is the uniqueness of the solution of linear equation systems. Each obtained system of linear equations should have a unique solution when the number of obtained equations equals the number of unknowns, that is, the number of valid future states N V.However, if the controlled distributions are not sufficiently different from each other, the determinant of the matrix of equations can become close to zero and be considered null within the numerical precision of the simulator software. This causes the software to be unable to find a unique solution to the system, even when N V equations are obtained. In our experiments a heuristic solution was adopted, by obtaining two more equations than necessary for each state. After this solution was adopted, the number of cases in which the system of equations was not solved for a unique solution was negligibly small. Fig. 12 Errors of the estimated v by the following methods: Z learning with estimation of p d and q, Z learning without p d or q estimation, and the traditional Q learning algorithm. 6. Conclusion The authors proposed a method for the direct application of Z learning in a true temporal difference approach without the need for previous knowledge about the passive dynamics or its state costs. All that is required is the possibility to measure immediate costs or total costs incurred in state transitions, as well as the knowledge of impossible state transitions and of the controlled state transitions imposed on the system. The complete knowledge of the controlled transition distributions and the possibility of imposing any desired distribution might not be available in realistic problems, in which symbolic actions might exist and the transition distributions might be dependent on those actions. A future work possibility would be to extend the method to consider symbolic actions. Acknowledgments This work was supported by JSPS KAKENHI Grant Numbers and Fig. 13 Difference of the error of the estimated v between Z learning with p d and q estimation, and traditional Z learning. Fig. 12 and Fig. 13), because the correct p d and q were still not estimated. Approximately after time step the errors of the estimated p d and q are very small, and the difference of the errors of v oscillates (Fig. 13), and approaches zero in the long run. Positive values mean that the error of v for Z learning with p d and q estimation is bigger. This was consistently observed for both simulated passive dynamics distributions. References [1] R.S. Sutton and A.G. Barto: Reinforcement learning: An introduction, MIT Press, [2] M.A.P. Burdelis and K. Ikeda: Temporal difference approach in linearly solvable Markov decision problems, Proc. Artificial Life and Robotics, GS12-3, [3] T. Kollar and N. Roy: Trajectory optimization using reinforcement learning for map exploration, The International Journal of Robotics Research, Vol. 27, pp , [4] J. Buchli, F. Stulp, E. Theodorou, and S. Schaal: Learning variable impedance control, The International Journal of Robotics Research, Vol. 30, pp , [5] J. Nie and S. Haykin: A dynamic channel assignment policy through Q-learning, IEEE Transactions on Neural Networks, Vol. 10, No. 6, pp , [6] K. Shibata and T. Kawano: Acquisition of flexible image recognition by coupling of reinforcement learning and a neural network, SICE Journal of Control, Measurement, and System Integration, Vol. 2, No. 2, pp , [7] A.A.M. Faudzi and K. Shibata: Acquisition of active perception and recognition through Actor-Q learning using a movable camera, Proc. SICE Annual Conf., FB03-2, [8] L. Jian: An agent bilateral multi-issue alternate bidding negotiation protocol based on reinforcement learning and its application in e-commerce, Proc. Int l Symposium on Electronic Commerce and Security, pp , [9] A. Gaweda, M. Muezzinoglu, G. Aronoff, A. Jacobs, J. Zurada, and M. Brier: Individualization of pharmacological anemia management using reinforcement learning, Neural Networks, Vol. 18, pp , [10] D. Bertsekas: Dynamic Programming and Optimal Control, Athena Scientific, [11] E. Todorov: Efficient computation of optimal actions, Proceedings of the National Academy of Sciences, Vol. 106, No. 28, pp , [12] E. Todorov: Linearly-solvable Markov decision problems, Schölkopf et al. ed., Adv Neural Informatin Proc Syst, Vol. 19,

54 SICE JCMSI, Vol. 7, No. 1, January 2014 pp. 1369 1376, MIT Press, 2007. [13] K. Doya: How can we learn efficiently to act optimally and flexibly?

Todorov: Compositionality of optimal control laws, Advances in Neural Information Processing Systems, Vol. 22, pp. 1856 1864, 2009. [15] K. Dvijotham, and E.

7 54 SICE JCMSI, Vol. 7, No. 1, January 2014 pp , MIT Press, [13] K. Doya: How can we learn efficiently to act optimally and flexibly?, Proceedings of the National Academy of Sciences, Vol. 106, No. 28, pp , [14] E. Todorov: Compositionality of optimal control laws, Advances in Neural Information Processing Systems, Vol. 22, pp , [15] K. Dvijotham, and E. Todorov: Inverse optimal control with linearly-solvable MDPs, International Conference on Machine Learning, Vol. 27, pp , [16] M.A.P. Burdelis and K. Ikeda: Modeling and estimating passive dynamics distributions in linearly solvable Markov decision processes, IEICE Technical Report, NC , pp , [17] M.A.P. Burdelis and K. Ikeda: Estimating passive dynamics distributions in linearly solvable Markov decision processes from measured immediate costs in reinforcement learning problems, Proc. JNNS, P3-20, Mauricio BURDELIS He received the B.E. and M.E. degrees in computer systems engineering from the University of Sao Paulo in 2001 and 2009, respectively. Since 2009, he has been a Ph.D. student at the Nara Institute of Science and Technology (NAIST) receiving scholarship from the Japanese Ministry of Education, Culture, Sports, Science, and Technology (MEXT). Kazushi IKEDA (Member) He received the B.E., M.E., and Ph.D. degrees in mathematical engineering and information physics from The University of Tokyo in 1989, 1991, and 1994, respectively. He was a research associate with the Department of Electrical and Computer Engineering, Kanazawa University from 1994 to In 1995, he was a research associate of the Chinese University of Hong Kong for three months. From 1998 to 2008, he was with the Graduate School of Informatics, Kyoto University, as an associate professor. Since 2008, he has been a full professor of Nara Institute of Science and Technology. He is the editorin-chief of Journal of Japanese Neural Network Society, an action editor of Neural Networks, an associate editor of IEEE Transactions on Neural Networks and an associate editor of IEICE Transactions on Information and Systems.

Linearly-solvable Markov decision problems

Advances in Neural Information Processing Systems 2 Linearly-solvable Markov decision problems Emanuel Todorov Department of Cognitive Science University of California San Diego todorov@cogsci.ucsd.edu