Average Reward Parameters

Size: px
Start display at page:

Download "Average Reward Parameters"

Transcription

1 Simulation-Based Optimization of Markov Reward Processes: Implementation Issues Peter Marbach 2 John N. Tsitsiklis 3 Abstract We consider discrete time, nite state space Markov reward processes which depend on a set of parameters. Previously, we proposed a simulation-based methodology to tune the parameters to optimize the average reward. The resulting algorithms converge with probability, but may have a high variance. Here we propose two approaches to reduce the variance, which however introduce a new bias into the update direction. We report numerical results which indicate that the resulting algorithms are robust with respect to a small bias. duce the variance of a typical update, but introduce an additional bias into the update direction. As gradienttype methods tend to be robust with respect to small biases, the resulting algorithm may have an improved practical performance. A numerical study that we carried out suggests that this is indeed the case. For a comparison of our approach with other simulationbased optimization methods such as the likelihoodratio method [Gly6, Gly], the innitesimal perturbation analysis (IPA) [CR4, CC, CW, FH4, FH], and neuro-dynamic programming/reinforcement learning in [JSJ5], we refer to [MT]. Introduction In [MT] we considered nite state Markov reward processes for which the transition probabilities and onestage rewards depend on a parameter vector 2 < K, and proposed a simulation-based methodology which uses gradient estimates for tuning the parameter to optimize the average reward. The resulting algorithms can be implemented online and have the property that the gradient of the average reward converges to zero with probability (which is the strongest possible result for gradient-related stochastic approximation algorithms). A drawback of these algorithms is that the updates may have a high variance, which can lead to slow convergence. This is due to the fact they essentially employ a renewal period to produce an estimate of the gradient. If the length of a typical renewal period is large (as it tends to be the case for systems involving a large state space) then the variance in the corresponding estimate can become quite high. In this paper, we address this issue and propose two approaches to reduce the variance: one which estimates the gradient based on trajectories which tend to be shorter than a renewal period and one which employs a discount factor. The resulting algorithms re- This research was supported by contracts with Siemens AG, Munich, Germany, and Alcatel Bell, Belgium and by contracts DMI-6254 and ACI-333 with the National Science Foundation 2 Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, MA 23, USA current aliation: Center for Communication Systems Research, Cambridge University, UK p.marbach@ccsr.cam.ac.uk 3 Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, MA 23, USA jnt@mit.edu 2 Formulation In this section we give a summary of the general framework, which is slightly more general than the one in [MT] (see [Mar] for a detailed discussion). Consider a discrete-time, nite-state Markov chain fi n g with state space S = f : : : Ng, whose transition probabilities depend on a parameter vector 2 < K. We denote the one-step transition probabilities by P ij (), i j 2 S, and the n?step transition probabilities by Pij(). n Whenever the state is equal to i, we receive a one-stage reward, that also depends on, and is denoted by g. For every 2 < K, let P () be the stochastic matrix with entries P ij (). Let P = fp () j 2 < K g be the set of all such matrices, and let P be its closure. Note that every element of P is also a stochastic matrix and, therefore, denes a Markov chain on the same state space. We make the following assumptions. Assumption (Recurrence) The Markov chain corresponding to every P 2 P is aperiodic. Furthermore, there exists a state i which is recurrent for every such Markov chain. Assumption 2 (Regularity) For all states i j 2 S and for all P 2 P, the transition probability P ij (), and the one-state reward g, are bounded, twice dierentiable, and have bounded rst and second derivatives. Furthermore, we have rp ij () = P ij ()L ij () where the function L ij () is bounded. p.

2 Assumption (Recurrence) states that under every transition matrix P 2 P we will eventually revisit the state i. This property allows us to employ results of renewal theory (see for example [Gal5]) for our analysis. Assumption 2 (Regularity) assures that the transition probabilities P ij () and the one-stage reward g depend \smoothly" on, and that the quotient rp ij ()=P ij () is \well behaved". As a performance metric we use the average reward criterion () = lim t! t E " t? k= g ik () Here, i k is the state visited at time k, and the notation E [] indicates that the expectation is taken with respect to the distribution of the Markov chain with transition probabilities P ij (). Under Assumption (Recurrence), the average reward () is well dened for every and does not depend on the initial state. We dene the the dierential reward v of state i 2 S and the mean recurrence time E [T ] by v = E " T?? # gik ()? () j i = i k= E [T ] = E [T j i = i ] where T = minfk > j i k = i g is the rst future time that state i is visited. We have that v =. 3 A Gradient-Based Algorithm for Updating Given that our goal is to maximize the average reward (), it is natural to consider gradient-type methods. Using the following expression for the gradient r() of the average reward with respect to (see [MT] for a derivation), r() = rg + rp ij ()v j () we could consider an algorithm of the following form k+ = k + k r( k ): # : A Alternatively, if we could use simulation to produce an unbiased estimate F k () of r( k ) and employ the approximate gradient iteration k+ = k + k F k () to tune. Unbiased estimates of the gradient exist [Gly6], but it is not clear how to use them in an algorithm which has the properties we will require in the following. This diculty is bypassed by the method developed in the following. We will again employ an estimate of the gradient to update the parameter vector which however is biased. Convergence of the resulting simulation based method can be established by showing that the bias (asymptotically) vanishes. 3. Estimation of r() We rewrite the expression for the gradient r() as follows (see [MT], r() = rg + () P ij ()L ij ()v j () A where L ij () = rp ij ()=P ij (. This expression suggests the following way to estimate the gradient r(). Let the parameter vector be xed to some value and let fi n g be a sample path of the corresponding Markov chain, possibly obtained through simulation. Furthermore, let t m be the time of the mth visit at the recurrent state i. Consider the estimate of r() given by F m ( ) ~ = where for i n = j ~v in ( ~ ) = ~v in ( )Lin?i ~ n () + rg in () k=n g ik ()? ~ () (2) is an estimate of the dierential reward v j () and ~ is some estimate of (). For n = t m we set ~v in ( ~ ) =. Assumption (Recurrence) allows us to employ renewal theory (see for example [Gal5]) to obtain the following result which states that the expectation of F m ( ~ ) is aligned with r() to the extend that ~ is close the () (see [MT, Mar]). Proposition We have where E hf m ( ~ ) i = E i2s +G()(()? ~ ) " tm+? G() = E = E [T ]r() + G()(()? ~ rg + rp ij ()v j () + (t m+? n)l in?i n () # A + ::: [Gly6] develops an algorithm using unbiased estimates of the gradient which updates the parameter vector at visits to the recurrent state i. It is not known if this algorithm can be extended so that the parameter gets updated at every time step, a property which we will require based on the discussion in Section 4. : p. 2

3 3.2 An Algorithm that Updates at Visits to the Recurrent State Using F m ( ~ ) as an estimate of the gradient direction, we can formulate the following algorithm which updates at visits to the recurrent state i. At the time t m that state i is visited for the mth time, we have available a current vector m and an average reward estimate ~ m. We then simulate the process according to the transition probabilities P ij ( m ) until the next time t m+ that i is visited and update according to m+ = m + m F m ( m ~ m ) (3) ~ m+ = ~ m + m (g in ( m )? ~ m ) (4) where is a positive scalar and m is a step size sequence which satises the following assumption. Assumption 3 (Step Size) The step sizes m are nonnegative and satisfy m= m = m= 2 m < : This assumption is satised, for example, if we let m = =m. Note that the update in (3) is a biased estimate of the gradient direction (see Proposition ). By showing that the update (4) drives this bias (asymptotically) to zero, we obtain the following convergence result (see [MT, Mar]). Proposition 2 Let Assumption (Recurrence), Assumption 2 (Regularity), and Assumption 3 (Step Size), hold and let f m g be the sequence of parameter vectors generated by the above described algorithm. Then, with probability, ( m ) converges and lim m! r( m) = : 4 Implementation Issues For systems involving a large state space, the interval between visits to the state i can be large. This means that in the algorithm proposed above the parameter vector gets updated only infrequently and the estimate F m () can have a large variance. In [MT], we have shown how the above algorithm can be extended so that the parameter vector gets updated at every time step. Here, we will in addition consider two approaches to reduce the variance in the updates, which are based on two alternative estimates of the dierential reward. 4. Reducing the Variance If the time until we reach the recurrent state i from state i is large, it may be desirable to choose a subset S of S containing i and to estimate v through ~v S i( ) ~ T? = g ik ()? ~ (5) k= where T = minfk > j i k 2 S g is the rst future time a state in the set S is visited. Note that it takes fewer time steps to reach the set S than to revisit the recurrent state i therefore ~v S i( ) ~ has typically a smaller variance than the estimate based on (2). As an alternative approach, we may use a factor, <, to discount future rewards. This leads to the following estimate, ~v i ( ~ ) = T? k g ik ()? ~ (6) k= where T = minfk > j i k = i g is the rst future time the state i is visited. In Sections 5 and 6 we will use (5) and (6) to derive modied estimates of the gradient r(). 5 Using the Set S to Reduce the Variance Let the parameter 2 < K be xed to some value and let (i i 2 :::) be a trajectory of the corresponding Markov chain. Let t m be the time of the mth time to the termination state i. Furthermore, given a set S S containing i, let (m) be the number of times between t m and t m+ that a state in the set S nfi g is visited, let the t mn be the time of the nth visit, and let t m and t m(m)+ be equal to t m and t m+, respectively. Using these denitions, we consider the estimate F S m( ) ~ of the gradient r() given by F S m( ~ ) = k=t m ~v S i k ( ~ )L ik? i k () + rg ik () where, for t mn k < t mn+, and i k 6= i, we set ~v S i k ( ~ ) = t mn+? l=k g il ()? ~ and, for k = t m, we set ~v ik ( ~ ) =. () One can show (see [Mar]), that the expectation E [F S m( )] ~ is of the same form as the expectation of the original estimate F m ( ) ~ in Proposition, however the exact value of the dierential reward v j () is replaced by the approximation v S j() = E " T?? # gik ()? () j i = j k= p. 3

4 where T = minfk > j i k 2 S g. This introduces a new bias term S () into the estimate of the gradient direction which is equal to E [T ] rp ij ()(v S j()? v j ()) A : 5. A Bound on the Bias S () To derive an upper bound on k S ()k, we use some suitable assumptions on the bias in the estimate of the dierential reward given by ^v S i() = v S i()? v : P The basic idea is the following. As rp ij() =, for all i 2 S and all 2 < K, we only have to know the \relative magnitudes" of the dierential rewards to compute r(), i.e. we have that for all constant c r() = rg + rp ij () v j ()? c A : Therefore, if ^v S i() = ^v S, for all i i 2 S, then the bias term k S ()k is equal to. In fact, the term k S ()k vanishes under the weaker assumption that for all states i 2 S we have ^v S j() = ^v S j (), if j j 2 S i, where S i = j 2 S j rp ij () 6= for some 2 < K. Now assume that there exists a such that, for all states i 2 S, we have ^v S j()? ^v S j () if j j 2 S i : Then we have that under Assumption (Recurrence) and Assumption 2 (Regularity) k S ()k T max NC where T max is an upper-bound on E [T ], C is a bound on krp ij ()k, and N = maxfni j i 2 Sg with where N i is the number of states in the set S i. Therefore, in order to keep the bias S () small one should choose S such that, for all states i 2 S and for all states j j 2 S i, the dierence ^vs j()? ^v S j () is small. A proof for this result is given in [Mar]. 5.2 An Algorithm that Updates at Visit to the Recurrent State Consider the following version of algorithm in Section 3 to update and ~ at visits to the recurrent state i, m+ = m + m F S m( m ) ~ m+ = ~ m + m (g in ( m )? ~ m ) where we use the estimate F S m( m ) in place of F m ( m ). Under Assumption (Recurrence) and Assumption 2 (Regularity), we have that with probability lim inf kr( m)k D : m! T min Here D is a bound on the bias k S ()k and T min is lower bound on E [T ] (see [Mar]). This establishes that if the bias k S ()k tends to be small, then the gradient r( m ) is small at innitely many visits to the recurrent state i. However, it does not say anything about the behavior of the sequence ( m ) or how we might detect instances at which the average reward ( m ) is high. This result can be strengthened in the following sense (see [Mar]): if the upper-bound D on k S ()k is small enough, then ~ m overestimates the average reward ( m ) at most by a little, i.e. there exists a bound B(D) such that lim sup m! ( ~ m? ( m )) B(D): This implies that ~ m can be used to detect instances where the average reward ( m ) is high. Similar to [MT], one can derive a version of this algorithm which updates at each time step (see [Mar]). 6 Using a Discount Factor to Reduce the Variance Let the parameter 2 < K be xed to some value and let (i i 2 :::) be a sample path of the corresponding Markov chain. Furthermore, given a discount factor 2 ( ), let t m be the time of the mth visit at the recurrent state i and consider the following estimate F m ( ) ~ of the gradient r(), F m ( ~ ) = where, t m < n t m+?, we set ~v in ( ) ~ = ~v in ( )L ~ in?i n () + rg in () k=n k?n g ik ()? ~ () and, for n = t m, we set ~v in ( ~ ) =. As in the previous section, the expectation E [F m ( ~ )] is of the same form as the expectation of the original estimate F m ( ~ ), except that the exact value of the dierential reward v j () is replaced by the approximation v j () = E " T? k? # g ik ()? () j i = j k= where T = minfk > j i k = i g. 6. A Bound on the Bias () To derive a bound on the resulting new bias () which is equal to E [T ] rp ij () v j ()? v j () A p. 4

5 ... Average Reward.5. Average Reward x x 6 Figure : Trajectory of the exact average reward (top) and threshold parameters (bottom) of the idealized gradient algorithm. we use the following assumption. There exist scalars, <, and A such that for, all states i 2 S and all integers n, we have Pij() n? j () g A n : () We can think of as a \mixing constant" for the Markov reward processes associated with P. It can be shown that under assumption Assumption (Recurrence) such constants A and exist (see [Mar]). It then follows, that under Assumption (Recurrence) and Assumption 2 (Regularity), k ()k AT max C N??? where T max is an upper-bound on E [T ], C is a bound on krp ij ()k, and N is the same as in Section 5. This result states that the bound on the bias () vanishes as # and as ". A proof is given in [Mar]. Again, we can use the estimate F m ( ~ ) to derive an algorithm which updates the parameter vector either at visits to the recurrent state i or at every time step (see [Mar]). The results of the previous section also hold in this case. Experimental Results We applied the algorithms proposed above to the problem where a provider of a communication link has to accept and reject incoming calls of several types, while taking into account current congestion. Calls are charged according to the type they belong to and the Figure 2: Trajectory of the average reward (top) and threshold parameters (bottom) of the simulation-based algorithm of Section 3 (the scaling factor for the iteration steps is 6 ). goal of the provider is to maximize long-term average reward. We consider the following admission control policy described in terms a \fuzzy threshold" parameter (m) for each call type m = ::: M. Given that the used link bandwidth is equal to B, a new call of type m is accepted with probability + exp(b? (m)) and is otherwise rejected. A detailed description of the case study can be found in [Mar]. In the following, we provide a short summary of the main results for a case study involving 3 dierent service types.. Idealized Gradient Algorithm For this case, we were able to compute the exact gradient r() and to implement the idealized gradient algorithm of Section 2. The corresponding trajectories are given in Figure. The resulting average reward is. (the optimal average reward obtained by dynamic programming is.6)..2 Simulation-Based Optimization Figure 2, 3, and 4, shows the results for the versions of the simulation-based algorithms of Section 2, Section 5 (for the choice of S see [Mar]), and Section 6 (with = :), respectively, where the parameter gets updated at every time step. We make here the following observations.. All three algorithms make rapid progress in the beginning, improving the average reward from p. 5

6 Average Reward x x 5 Average Reward x x 5 Figure 3: Trajectory of the average reward (top) and threshold parameters (bottom) of the simulation-based algorithm of Section 5 (the scaling factor for the iteration steps is 5 ). Figure 4: Trajectory of the average reward (top) and threshold parameters (bottom) of the simulation-based algorithm of Section 6 (the scaling factor for the iteration steps is 5 ).. to. within the rst 6 iteration steps for the algorithm of Section 2 and within rst 5 iteration steps for the algorithms of Section 5 and 6. After reaching an average reward of., the algorithms progresses only slow. This is not unlike the behavior of the idealized algorithm (see Figure ). 2. The simulation-based algorithms of Section 2 attains an average reward of. after 6 iteration steps, while the algorithms of Section 5 and 6 obtain an average reward of.2, and.5 respectively, after 6 iteration steps. These results are encouraging, as the algorithms with reduced variance speed up the convergence by an order of magnitude, while introducing a negligible bias. References [Ber5a] D. P. Bertsekas. Dynamic Programming and Optimal Control, Vol. I and II. Athena Scientic, Belmont, MA, 5. [Ber5b] D. P. Bertsekas. Nonlinear Programming. Athena Scientic, Belmont, MA, 5. [CC]. R. Cao and H. F. Chen. Perturbation Realization, Potentials, and Sensitivity Analysis of Markov Processes. IEEE Transactions on Automatic Control, 42:32{33,. [CR4] E. K. P. Chong and P. J. Ramadage. Stochastic Optimization of Regenerative Systems Using In- nitesimal Perturbation Analysis. IEEE Trans. on Automatic Control, 3:4{4, 4. [CW]. R. Cao and Y. W. Wan. Algorithms for Sensitivity Analysis of Markov Systems through Potentials and Perturbation Realization. IEEE Trans. on Control Systems Technology, 6:42{44,. [FH4] M. C. Fu and J.-Q. Hu. Smoothed Perturbation Analysis Derivative Estimation for Markov Chains. Operations Research Letters, 5:24{25, 4. [FH] M. Fu and J.-Q. Hu. Conditional Monte Carlo: Gradient Estimation and Optimization Applications. Kluwer Academic Publisher, Boston, MA,. [Gal5] R. G. Gallager. Discrete Stochastic Processes. Kluwer Academic Publishers, Boston/Dordrech/London, 5. [Gly6] P. W. Glynn. Stochastic Approximation for Monte Carlo Optimization. Proceedings of the 6 Winter Simulation Conference, pages 25{2, 6. [Gly] P. W. Glynn. Likelihood Ratio Gradient Estimation: An Overview. Proceedings of the Winter Simulation Conference, pages 366{35,. [JSJ5] T. Jaakkola, S. P. Singh, and M. I. Jordan. Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems. Advances in Neural Information Processing Systems, :35{46, 5. [Mar] P. Marbach. Simulation-based optimization of markov decision processes. PhD Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, MA,. [MT] P. Marbach and J. N. Tsitsiklis. Simulationbased optimization of markov reward processes. Technical Report, LIDS-P 24, Lab. for Info. and Decision System, Massachusetts Institute of Technology, MA,. p. 6

Simulation-Based Optimization of Markov Reward Processes

Simulation-Based Optimization of Markov Reward Processes IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 46, NO. 2, FEBRUARY 2001 191 Simulation-Based Optimization of Markov Reward Processes Peter Marbach John N. Tsitsiklis, Fellow, IEEE Abstract This paper proposes

More information

On the Convergence of Optimistic Policy Iteration

On the Convergence of Optimistic Policy Iteration Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

On Average Versus Discounted Reward Temporal-Difference Learning

On Average Versus Discounted Reward Temporal-Difference Learning Machine Learning, 49, 179 191, 2002 c 2002 Kluwer Academic Publishers. Manufactured in The Netherlands. On Average Versus Discounted Reward Temporal-Difference Learning JOHN N. TSITSIKLIS Laboratory for

More information

A System Theoretic Perspective of Learning and Optimization

A System Theoretic Perspective of Learning and Optimization A System Theoretic Perspective of Learning and Optimization Xi-Ren Cao* Hong Kong University of Science and Technology Clear Water Bay, Kowloon, Hong Kong eecao@ee.ust.hk Abstract Learning and optimization

More information

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G. In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and J. Alspector, (Eds.). Morgan Kaufmann Publishers, San Fancisco, CA. 1994. Convergence of Indirect Adaptive Asynchronous

More information

Adaptive linear quadratic control using policy. iteration. Steven J. Bradtke. University of Massachusetts.

Adaptive linear quadratic control using policy. iteration. Steven J. Bradtke. University of Massachusetts. Adaptive linear quadratic control using policy iteration Steven J. Bradtke Computer Science Department University of Massachusetts Amherst, MA 01003 bradtke@cs.umass.edu B. Erik Ydstie Department of Chemical

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

In: Proc. BENELEARN-98, 8th Belgian-Dutch Conference on Machine Learning, pp 9-46, 998 Linear Quadratic Regulation using Reinforcement Learning Stephan ten Hagen? and Ben Krose Department of Mathematics,

More information

Linear stochastic approximation driven by slowly varying Markov chains

Linear stochastic approximation driven by slowly varying Markov chains Available online at www.sciencedirect.com Systems & Control Letters 50 2003 95 102 www.elsevier.com/locate/sysconle Linear stochastic approximation driven by slowly varying Marov chains Viay R. Konda,

More information

The Bias-Variance dilemma of the Monte Carlo. method. Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel

The Bias-Variance dilemma of the Monte Carlo. method. Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel The Bias-Variance dilemma of the Monte Carlo method Zlochin Mark 1 and Yoram Baram 1 Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel fzmark,baramg@cs.technion.ac.il Abstract.

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Almost Sure Convergence of Two Time-Scale Stochastic Approximation Algorithms

Almost Sure Convergence of Two Time-Scale Stochastic Approximation Algorithms Almost Sure Convergence of Two Time-Scale Stochastic Approximation Algorithms Vladislav B. Tadić Abstract The almost sure convergence of two time-scale stochastic approximation algorithms is analyzed under

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

Semi-Markov Decision Problems and Performance Sensitivity Analysis

Semi-Markov Decision Problems and Performance Sensitivity Analysis 758 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 48, NO. 5, MAY 2003 Semi-Markov Decision Problems Performance Sensitivity Analysis Xi-Ren Cao, Fellow, IEEE Abstract Recent research indicates that Markov

More information

Linearly-solvable Markov decision problems

Linearly-solvable Markov decision problems Advances in Neural Information Processing Systems 2 Linearly-solvable Markov decision problems Emanuel Todorov Department of Cognitive Science University of California San Diego todorov@cogsci.ucsd.edu

More information

On-Line Policy Gradient Estimation with Multi-Step Sampling

On-Line Policy Gradient Estimation with Multi-Step Sampling Discrete Event Dyn Syst (200) 20:3 7 DOI 0.007/s0626-009-0078-3 On-ine Policy Gradient Estimation with Multi-Step Sampling Yan-Jie i Fang Cao Xi-Ren Cao Received: 22 May 2008 / Accepted: 4 July 2009 /

More information

Convergence of Synchronous Reinforcement Learning. with linear function approximation

Convergence of Synchronous Reinforcement Learning. with linear function approximation Convergence of Synchronous Reinforcement Learning with Linear Function Approximation Artur Merke artur.merke@udo.edu Lehrstuhl Informatik, University of Dortmund, 44227 Dortmund, Germany Ralf Schoknecht

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

1 Introduction The purpose of this paper is to illustrate the typical behavior of learning algorithms using stochastic approximations (SA). In particu

1 Introduction The purpose of this paper is to illustrate the typical behavior of learning algorithms using stochastic approximations (SA). In particu Strong Points of Weak Convergence: A Study Using RPA Gradient Estimation for Automatic Learning Felisa J. Vazquez-Abad * Department of Computer Science and Operations Research University of Montreal, Montreal,

More information

1 Problem Formulation

1 Problem Formulation Book Review Self-Learning Control of Finite Markov Chains by A. S. Poznyak, K. Najim, and E. Gómez-Ramírez Review by Benjamin Van Roy This book presents a collection of work on algorithms for learning

More information

A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies

A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies Huizhen Yu Lab for Information and Decision Systems Massachusetts Institute of Technology Cambridge,

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement

More information

Bias-Variance Error Bounds for Temporal Difference Updates

Bias-Variance Error Bounds for Temporal Difference Updates Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds

More information

October 7, :8 WSPC/WS-IJWMIP paper. Polynomial functions are renable

October 7, :8 WSPC/WS-IJWMIP paper. Polynomial functions are renable International Journal of Wavelets, Multiresolution and Information Processing c World Scientic Publishing Company Polynomial functions are renable Henning Thielemann Institut für Informatik Martin-Luther-Universität

More information

On Mean Curvature Diusion in Nonlinear Image Filtering. Adel I. El-Fallah and Gary E. Ford. University of California, Davis. Davis, CA

On Mean Curvature Diusion in Nonlinear Image Filtering. Adel I. El-Fallah and Gary E. Ford. University of California, Davis. Davis, CA On Mean Curvature Diusion in Nonlinear Image Filtering Adel I. El-Fallah and Gary E. Ford CIPIC, Center for Image Processing and Integrated Computing University of California, Davis Davis, CA 95616 Abstract

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018 Section Notes 9 Midterm 2 Review Applied Math / Engineering Sciences 121 Week of December 3, 2018 The following list of topics is an overview of the material that was covered in the lectures and sections

More information

Notes on Iterated Expectations Stephen Morris February 2002

Notes on Iterated Expectations Stephen Morris February 2002 Notes on Iterated Expectations Stephen Morris February 2002 1. Introduction Consider the following sequence of numbers. Individual 1's expectation of random variable X; individual 2's expectation of individual

More information

Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C.

Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C. Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C. Spall John Wiley and Sons, Inc., 2003 Preface... xiii 1. Stochastic Search

More information

Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms

Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms Jonathan Baxter and Peter L. Bartlett Research School of Information Sciences and Engineering Australian National University

More information

ROYAL INSTITUTE OF TECHNOLOGY KUNGL TEKNISKA HÖGSKOLAN. Department of Signals, Sensors & Systems

ROYAL INSTITUTE OF TECHNOLOGY KUNGL TEKNISKA HÖGSKOLAN. Department of Signals, Sensors & Systems The Evil of Supereciency P. Stoica B. Ottersten To appear as a Fast Communication in Signal Processing IR-S3-SB-9633 ROYAL INSTITUTE OF TECHNOLOGY Department of Signals, Sensors & Systems Signal Processing

More information

Reinforcement Learning as Classification Leveraging Modern Classifiers

Reinforcement Learning as Classification Leveraging Modern Classifiers Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts.

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts. An Upper Bound on the oss from Approximate Optimal-Value Functions Satinder P. Singh Richard C. Yee Department of Computer Science University of Massachusetts Amherst, MA 01003 singh@cs.umass.edu, yee@cs.umass.edu

More information

DIFFERENTIAL TRAINING OF 1 ROLLOUT POLICIES

DIFFERENTIAL TRAINING OF 1 ROLLOUT POLICIES Appears in Proc. of the 35th Allerton Conference on Communication, Control, and Computing, Allerton Park, Ill., October 1997 DIFFERENTIAL TRAINING OF 1 ROLLOUT POLICIES by Dimitri P. Bertsekas 2 Abstract

More information

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs 2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter

More information

Overview Example (TD-Gammon) Admission Why approximate RL is hard TD() Fitted value iteration (collocation) Example (k-nn for hill-car)

Overview Example (TD-Gammon) Admission Why approximate RL is hard TD() Fitted value iteration (collocation) Example (k-nn for hill-car) Function Approximation in Reinforcement Learning Gordon Geo ggordon@cs.cmu.edu November 5, 999 Overview Example (TD-Gammon) Admission Why approximate RL is hard TD() Fitted value iteration (collocation)

More information

Alternative Characterization of Ergodicity for Doubly Stochastic Chains

Alternative Characterization of Ergodicity for Doubly Stochastic Chains Alternative Characterization of Ergodicity for Doubly Stochastic Chains Behrouz Touri and Angelia Nedić Abstract In this paper we discuss the ergodicity of stochastic and doubly stochastic chains. We define

More information

Abstract Dynamic Programming

Abstract Dynamic Programming Abstract Dynamic Programming Dimitri P. Bertsekas Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Overview of the Research Monograph Abstract Dynamic Programming"

More information

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you

More information

WE consider finite-state Markov decision processes

WE consider finite-state Markov decision processes IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 7, JULY 2009 1515 Convergence Results for Some Temporal Difference Methods Based on Least Squares Huizhen Yu and Dimitri P. Bertsekas Abstract We consider

More information

Policy Gradient Reinforcement Learning for Robotics

Policy Gradient Reinforcement Learning for Robotics Policy Gradient Reinforcement Learning for Robotics Michael C. Koval mkoval@cs.rutgers.edu Michael L. Littman mlittman@cs.rutgers.edu May 9, 211 1 Introduction Learning in an environment with a continuous

More information

Carnegie Mellon University Forbes Ave. Pittsburgh, PA 15213, USA. fmunos, leemon, V (x)ln + max. cost functional [3].

Carnegie Mellon University Forbes Ave. Pittsburgh, PA 15213, USA. fmunos, leemon, V (x)ln + max. cost functional [3]. Gradient Descent Approaches to Neural-Net-Based Solutions of the Hamilton-Jacobi-Bellman Equation Remi Munos, Leemon C. Baird and Andrew W. Moore Robotics Institute and Computer Science Department, Carnegie

More information

output dimension input dimension Gaussian evidence Gaussian Gaussian evidence evidence from t +1 inputs and outputs at time t x t+2 x t-1 x t+1

output dimension input dimension Gaussian evidence Gaussian Gaussian evidence evidence from t +1 inputs and outputs at time t x t+2 x t-1 x t+1 To appear in M. S. Kearns, S. A. Solla, D. A. Cohn, (eds.) Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 999. Learning Nonlinear Dynamical Systems using an EM Algorithm Zoubin

More information

Reinforcement Learning: the basics

Reinforcement Learning: the basics Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Connexions module: m11446 1 Maximum Likelihood Estimation Clayton Scott Robert Nowak This work is produced by The Connexions Project and licensed under the Creative Commons Attribution License Abstract

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

Approximate active fault detection and control

Approximate active fault detection and control Approximate active fault detection and control Jan Škach Ivo Punčochář Miroslav Šimandl Department of Cybernetics Faculty of Applied Sciences University of West Bohemia Pilsen, Czech Republic 11th European

More information

Introduction to Approximate Dynamic Programming

Introduction to Approximate Dynamic Programming Introduction to Approximate Dynamic Programming Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Approximate Dynamic Programming 1 Key References Bertsekas, D.P.

More information

State Space Abstractions for Reinforcement Learning

State Space Abstractions for Reinforcement Learning State Space Abstractions for Reinforcement Learning Rowan McAllister and Thang Bui MLG RCC 6 November 24 / 24 Outline Introduction Markov Decision Process Reinforcement Learning State Abstraction 2 Abstraction

More information

Gaussian Processes for Regression. Carl Edward Rasmussen. Department of Computer Science. Toronto, ONT, M5S 1A4, Canada.

Gaussian Processes for Regression. Carl Edward Rasmussen. Department of Computer Science. Toronto, ONT, M5S 1A4, Canada. In Advances in Neural Information Processing Systems 8 eds. D. S. Touretzky, M. C. Mozer, M. E. Hasselmo, MIT Press, 1996. Gaussian Processes for Regression Christopher K. I. Williams Neural Computing

More information

Spurious Chaotic Solutions of Dierential. Equations. Sigitas Keras. September Department of Applied Mathematics and Theoretical Physics

Spurious Chaotic Solutions of Dierential. Equations. Sigitas Keras. September Department of Applied Mathematics and Theoretical Physics UNIVERSITY OF CAMBRIDGE Numerical Analysis Reports Spurious Chaotic Solutions of Dierential Equations Sigitas Keras DAMTP 994/NA6 September 994 Department of Applied Mathematics and Theoretical Physics

More information

Proceedings of the International Conference on Neural Networks, Orlando Florida, June Leemon C. Baird III

Proceedings of the International Conference on Neural Networks, Orlando Florida, June Leemon C. Baird III Proceedings of the International Conference on Neural Networks, Orlando Florida, June 1994. REINFORCEMENT LEARNING IN CONTINUOUS TIME: ADVANTAGE UPDATING Leemon C. Baird III bairdlc@wl.wpafb.af.mil Wright

More information

To appear in Machine Learning Journal,, 1{37 ()

To appear in Machine Learning Journal,, 1{37 () To appear in Machine Learning Journal,, 1{37 () c Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. A Study of Reinforcement Learning in the Continuous Case by the Means of Viscosity

More information

1 Introduction This work follows a paper by P. Shields [1] concerned with a problem of a relation between the entropy rate of a nite-valued stationary

1 Introduction This work follows a paper by P. Shields [1] concerned with a problem of a relation between the entropy rate of a nite-valued stationary Prexes and the Entropy Rate for Long-Range Sources Ioannis Kontoyiannis Information Systems Laboratory, Electrical Engineering, Stanford University. Yurii M. Suhov Statistical Laboratory, Pure Math. &

More information

Infinite-Horizon Policy-Gradient Estimation

Infinite-Horizon Policy-Gradient Estimation Journal of Articial Intelligence Research 5 (200) 39-350 Submitted 9/00; published /0 Innite-Horizon Policy-Gradient Estimation Jonathan Baxter WhizBang! Labs. 466 Henry Street Pittsburgh, PA 523 Peter

More information

H 1 optimisation. Is hoped that the practical advantages of receding horizon control might be combined with the robustness advantages of H 1 control.

H 1 optimisation. Is hoped that the practical advantages of receding horizon control might be combined with the robustness advantages of H 1 control. A game theoretic approach to moving horizon control Sanjay Lall and Keith Glover Abstract A control law is constructed for a linear time varying system by solving a two player zero sum dierential game

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes

More information

The Art of Sequential Optimization via Simulations

The Art of Sequential Optimization via Simulations The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint

More information

Approximating Q-values with Basis Function Representations. Philip Sabes. Department of Brain and Cognitive Sciences

Approximating Q-values with Basis Function Representations. Philip Sabes. Department of Brain and Cognitive Sciences Approximating Q-values with Basis Function Representations Philip Sabes Department of Brain and Cognitive Sciences Massachusetts Institute of Technology Cambridge, MA 39 sabes@psyche.mit.edu The consequences

More information

2 Section 2 However, in order to apply the above idea, we will need to allow non standard intervals ('; ) in the proof. More precisely, ' and may gene

2 Section 2 However, in order to apply the above idea, we will need to allow non standard intervals ('; ) in the proof. More precisely, ' and may gene Introduction 1 A dierential intermediate value theorem by Joris van der Hoeven D pt. de Math matiques (B t. 425) Universit Paris-Sud 91405 Orsay Cedex France June 2000 Abstract Let T be the eld of grid-based

More information

Let (Ω, F) be a measureable space. A filtration in discrete time is a sequence of. F s F t

Let (Ω, F) be a measureable space. A filtration in discrete time is a sequence of. F s F t 2.2 Filtrations Let (Ω, F) be a measureable space. A filtration in discrete time is a sequence of σ algebras {F t } such that F t F and F t F t+1 for all t = 0, 1,.... In continuous time, the second condition

More information

arxiv: v1 [cs.ai] 5 Nov 2017

arxiv: v1 [cs.ai] 5 Nov 2017 arxiv:1711.01569v1 [cs.ai] 5 Nov 2017 Markus Dumke Department of Statistics Ludwig-Maximilians-Universität München markus.dumke@campus.lmu.de Abstract Temporal-difference (TD) learning is an important

More information

Optimal Convergence in Multi-Agent MDPs

Optimal Convergence in Multi-Agent MDPs Optimal Convergence in Multi-Agent MDPs Peter Vrancx 1, Katja Verbeeck 2, and Ann Nowé 1 1 {pvrancx, ann.nowe}@vub.ac.be, Computational Modeling Lab, Vrije Universiteit Brussel 2 k.verbeeck@micc.unimaas.nl,

More information

Optimal Tuning of Continual Online Exploration in Reinforcement Learning

Optimal Tuning of Continual Online Exploration in Reinforcement Learning Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens {youssef.achbany, francois.fouss, luh.yen, alain.pirotte,

More information

arxiv: v1 [cs.lg] 23 Oct 2017

arxiv: v1 [cs.lg] 23 Oct 2017 Accelerated Reinforcement Learning K. Lakshmanan Department of Computer Science and Engineering Indian Institute of Technology (BHU), Varanasi, India Email: lakshmanank.cse@itbhu.ac.in arxiv:1710.08070v1

More information

Homework 1 Solutions ECEn 670, Fall 2013

Homework 1 Solutions ECEn 670, Fall 2013 Homework Solutions ECEn 670, Fall 03 A.. Use the rst seven relations to prove relations (A.0, (A.3, and (A.6. Prove (F G c F c G c (A.0. (F G c ((F c G c c c by A.6. (F G c F c G c by A.4 Prove F (F G

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Semi-strongly asymptotically non-expansive mappings and their applications on xed point theory

Semi-strongly asymptotically non-expansive mappings and their applications on xed point theory Hacettepe Journal of Mathematics and Statistics Volume 46 (4) (2017), 613 620 Semi-strongly asymptotically non-expansive mappings and their applications on xed point theory Chris Lennard and Veysel Nezir

More information

Online solution of the average cost Kullback-Leibler optimization problem

Online solution of the average cost Kullback-Leibler optimization problem Online solution of the average cost Kullback-Leibler optimization problem Joris Bierkens Radboud University Nijmegen j.bierkens@science.ru.nl Bert Kappen Radboud University Nijmegen b.kappen@science.ru.nl

More information

Lecture notes for Analysis of Algorithms : Markov decision processes

Lecture notes for Analysis of Algorithms : Markov decision processes Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

Monte Carlo Linear Algebra: A Review and Recent Results

Monte Carlo Linear Algebra: A Review and Recent Results Monte Carlo Linear Algebra: A Review and Recent Results Dimitri P. Bertsekas Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Caradache, France, 2012 *Athena

More information

Numerical Solution of Hybrid Fuzzy Dierential Equation (IVP) by Improved Predictor-Corrector Method

Numerical Solution of Hybrid Fuzzy Dierential Equation (IVP) by Improved Predictor-Corrector Method Available online at http://ijim.srbiau.ac.ir Int. J. Industrial Mathematics Vol. 1, No. 2 (2009)147-161 Numerical Solution of Hybrid Fuzzy Dierential Equation (IVP) by Improved Predictor-Corrector Method

More information

/97/$10.00 (c) 1997 AACC

/97/$10.00 (c) 1997 AACC Optimal Random Perturbations for Stochastic Approximation using a Simultaneous Perturbation Gradient Approximation 1 PAYMAN SADEGH, and JAMES C. SPALL y y Dept. of Mathematical Modeling, Technical University

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) 1.1 The Formal Denition of a Vector Space

Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) 1.1 The Formal Denition of a Vector Space Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) Contents 1 Vector Spaces 1 1.1 The Formal Denition of a Vector Space.................................. 1 1.2 Subspaces...................................................

More information

16.410/413 Principles of Autonomy and Decision Making

16.410/413 Principles of Autonomy and Decision Making 16.410/413 Principles of Autonomy and Decision Making Lecture 23: Markov Decision Processes Policy Iteration Emilio Frazzoli Aeronautics and Astronautics Massachusetts Institute of Technology December

More information

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B. Advanced Topics in Machine Learning, GI13, 2010/11 Advanced Topics in Machine Learning, GI13, 2010/11 Answer any THREE questions. Each question is worth 20 marks. Use separate answer books Answer any THREE

More information

G METHOD IN ACTION: FROM EXACT SAMPLING TO APPROXIMATE ONE

G METHOD IN ACTION: FROM EXACT SAMPLING TO APPROXIMATE ONE G METHOD IN ACTION: FROM EXACT SAMPLING TO APPROXIMATE ONE UDREA PÄUN Communicated by Marius Iosifescu The main contribution of this work is the unication, by G method using Markov chains, therefore, a

More information

STOCHASTIC DIFFERENTIAL EQUATIONS WITH EXTRA PROPERTIES H. JEROME KEISLER. Department of Mathematics. University of Wisconsin.

STOCHASTIC DIFFERENTIAL EQUATIONS WITH EXTRA PROPERTIES H. JEROME KEISLER. Department of Mathematics. University of Wisconsin. STOCHASTIC DIFFERENTIAL EQUATIONS WITH EXTRA PROPERTIES H. JEROME KEISLER Department of Mathematics University of Wisconsin Madison WI 5376 keisler@math.wisc.edu 1. Introduction The Loeb measure construction

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011,

More information

Convergence of Simulation-Based Policy Iteration

Convergence of Simulation-Based Policy Iteration Convergence of Simulation-Based Policy Iteration William L. Cooper Department of Mechanical Engineering University of Minnesota, 111 Church Street S.E., Minneapolis, MN 55455 billcoop@me.umn.edu (612)

More information

and the nite horizon cost index with the nite terminal weighting matrix F > : N?1 X J(z r ; u; w) = [z(n)? z r (N)] T F [z(n)? z r (N)] + t= [kz? z r

and the nite horizon cost index with the nite terminal weighting matrix F > : N?1 X J(z r ; u; w) = [z(n)? z r (N)] T F [z(n)? z r (N)] + t= [kz? z r Intervalwise Receding Horizon H 1 -Tracking Control for Discrete Linear Periodic Systems Ki Baek Kim, Jae-Won Lee, Young Il. Lee, and Wook Hyun Kwon School of Electrical Engineering Seoul National University,

More information

Admission control schemes to provide class-level QoS in multiservice networks q

Admission control schemes to provide class-level QoS in multiservice networks q Computer Networks 35 (2001) 307±326 www.elsevier.com/locate/comnet Admission control schemes to provide class-level QoS in multiservice networks q Suresh Kalyanasundaram a,1, Edwin K.P. Chong b, Ness B.

More information

On the Approximate Solution of POMDP and the Near-Optimality of Finite-State Controllers

On the Approximate Solution of POMDP and the Near-Optimality of Finite-State Controllers On the Approximate Solution of POMDP and the Near-Optimality of Finite-State Controllers Huizhen (Janey) Yu (janey@mit.edu) Dimitri Bertsekas (dimitrib@mit.edu) Lab for Information and Decision Systems,

More information

Generalization and Function Approximation

Generalization and Function Approximation Generalization and Function Approximation 0 Generalization and Function Approximation Suggested reading: Chapter 8 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.

More information

Linear Algebra (part 1) : Matrices and Systems of Linear Equations (by Evan Dummit, 2016, v. 2.02)

Linear Algebra (part 1) : Matrices and Systems of Linear Equations (by Evan Dummit, 2016, v. 2.02) Linear Algebra (part ) : Matrices and Systems of Linear Equations (by Evan Dummit, 206, v 202) Contents 2 Matrices and Systems of Linear Equations 2 Systems of Linear Equations 2 Elimination, Matrix Formulation

More information

Variance Adjusted Actor Critic Algorithms

Variance Adjusted Actor Critic Algorithms Variance Adjusted Actor Critic Algorithms 1 Aviv Tamar, Shie Mannor arxiv:1310.3697v1 [stat.ml 14 Oct 2013 Abstract We present an actor-critic framework for MDPs where the objective is the variance-adjusted

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process

More information

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes Maximum Likelihood Estimation Econometrics II Department of Economics Universidad Carlos III de Madrid Máster Universitario en Desarrollo y Crecimiento Económico Outline 1 3 4 General Approaches to Parameter

More information

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 43, NO. 5, MAY

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 43, NO. 5, MAY IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 43, NO. 5, MAY 1998 631 Centralized and Decentralized Asynchronous Optimization of Stochastic Discrete-Event Systems Felisa J. Vázquez-Abad, Christos G. Cassandras,

More information

Constructing Learning Models from Data: The Dynamic Catalog Mailing Problem

Constructing Learning Models from Data: The Dynamic Catalog Mailing Problem Constructing Learning Models from Data: The Dynamic Catalog Mailing Problem Peng Sun May 6, 2003 Problem and Motivation Big industry In 2000 Catalog companies in the USA sent out 7 billion catalogs, generated

More information

Batch-mode, on-line, cyclic, and almost cyclic learning 1 1 Introduction In most neural-network applications, learning plays an essential role. Throug

Batch-mode, on-line, cyclic, and almost cyclic learning 1 1 Introduction In most neural-network applications, learning plays an essential role. Throug A theoretical comparison of batch-mode, on-line, cyclic, and almost cyclic learning Tom Heskes and Wim Wiegerinck RWC 1 Novel Functions SNN 2 Laboratory, Department of Medical hysics and Biophysics, University

More information

The Optimal Reward Baseline for Gradient Based Reinforcement Learning

The Optimal Reward Baseline for Gradient Based Reinforcement Learning 538 WEAVER& TAO UAI2001 The Optimal Reward Baseline for Gradient Based Reinforcement Learning Lex Weaver Department of Computer Science Australian National University ACT AUSTRALIA 0200 Lex. Weaver@cs.anu.edu.au

More information