POMDPs and Policy Gradients

Similar documents
Lecture 23: Reinforcement Learning

CS599 Lecture 1 Introduction To RL

Artificial Intelligence

Lecture 1: March 7, 2018

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

CS 570: Machine Learning Seminar. Fall 2016

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Open Theoretical Questions in Reinforcement Learning

Lecture 3: Markov Decision Processes

Q-Learning in Continuous State Action Spaces

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Reinforcement Learning

Reinforcement Learning II

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Reinforcement Learning

Reinforcement Learning: the basics

RL 14: POMDPs continued

Introduction to Reinforcement Learning

Partially Observable Markov Decision Processes (POMDPs)

ARTIFICIAL INTELLIGENCE. Reinforcement learning

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Lecture 8: Policy Gradient

Reinforcement Learning

Kalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN)

Reinforcement learning an introduction

Grundlagen der Künstlichen Intelligenz

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Reinforcement Learning and NLP

MDP Preliminaries. Nan Jiang. February 10, 2019

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Reinforcement Learning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

6 Reinforcement Learning

Reinforcement Learning II. George Konidaris

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning II. George Konidaris

Reinforcement Learning

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Planning with Predictive State Representations

Elements of Reinforcement Learning

An Introduction to Reinforcement Learning

Internal-State Policy-Gradient Algorithms for Partially Observable Markov Decision Processes

Internet Monetization

An Adaptive Clustering Method for Model-free Reinforcement Learning

Reinforcement Learning

Machine Learning I Reinforcement Learning

Reinforcement Learning as Classification Leveraging Modern Classifiers

, and rewards and transition matrices as shown below:

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Reinforcement Learning

A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation

A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation

Temporal Difference Learning & Policy Iteration

Planning and Acting in Partially Observable Stochastic Domains

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Basics of reinforcement learning

Deep Reinforcement Learning

Reinforcement Learning

Efficient Maximization in Solving POMDPs

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Reinforcement Learning. Introduction

Reinforcement Learning

A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies

Learning in Zero-Sum Team Markov Games using Factored Value Functions

Reinforcement Learning

Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms

Approximation Methods in Reinforcement Learning

Policy-Gradients for PSRs and POMDPs

QUICR-learning for Multi-Agent Coordination

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

University of Alberta

Optimally Solving Dec-POMDPs as Continuous-State MDPs

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Grundlagen der Künstlichen Intelligenz

An Introduction to Reinforcement Learning

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

15-780: ReinforcementLearning

Lecture 7: Value Function Approximation

Reinforcement Learning (1)

Reinforcement Learning

The convergence limit of the temporal difference learning

RL 3: Reinforcement Learning

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

On the Approximate Solution of POMDP and the Near-Optimality of Finite-State Controllers

Generalization and Function Approximation

Probabilistic Planning. George Konidaris

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Least-Squares Temporal Difference Learning based on Extreme Learning Machine

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Reinforcement learning

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Reinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil

CS 598 Statistical Reinforcement Learning. Nan Jiang

Reinforcement Learning Active Learning

arxiv:cs/ v1 [cs.lg] 2 Mar 2001

Transcription:

POMDPs and Policy Gradients MLSS 2006, Canberra Douglas Aberdeen Canberra Node, RSISE Building Australian National University 15th February 2006

Outline 1 Introduction What is Reinforcement Learning? Types of RL 2 Value-Methods Model Based 3 Partial Observability 4 Policy-Gradient Methods Model Based Experience Based

Reinforcement Learning (RL) in a Nutshell RL can learn any function RL inherently handles uncertainty Uncertainty in actions (the world) Uncertainty in observations (sensors) Directly maximise criteria we care about RL copes with delayed feedback Temporal credit assignment problem

Reinforcement Learning (RL) in a Nutshell RL can learn any function RL inherently handles uncertainty Uncertainty in actions (the world) Uncertainty in observations (sensors) Directly maximise criteria we care about RL copes with delayed feedback Temporal credit assignment problem

Reinforcement Learning (RL) in a Nutshell RL can learn any function RL inherently handles uncertainty Uncertainty in actions (the world) Uncertainty in observations (sensors) Directly maximise criteria we care about RL copes with delayed feedback Temporal credit assignment problem

Reinforcement Learning (RL) in a Nutshell RL can learn any function RL inherently handles uncertainty Uncertainty in actions (the world) Uncertainty in observations (sensors) Directly maximise criteria we care about RL copes with delayed feedback Temporal credit assignment problem

Examples BackGammon: TD-Gammon [12] Beat the world champion in individual games Can learn things no human ever thought of! TD-Gammon opening moves now used by best humans Australian Computer Chess Champion [4] Australian Champion Chess Player RL learns the evaluation function at leaves of min-max search Elevator Scheduling [6] Crites, Barto 1996 Optimally dispatch multiple elevators to calls Not implemented as far as I know

Partially Observable Markov Decision Processes POMDP MDP world state s Pr[s s,a] Pr[o s] Partial Observability r(s) RL o w Agent Pr[a o,w] Pr[a o,w] ~ a

Types of RL Value Policy DP MDP POMDP Model Based RL Experience

Optimality Criteria The value V (s) is a long-term reward from state s How do we measure long-term reward?? [ ] V (s) = E w r(s t ) s 0 = s t=0 Ill-conditioned from the decision making point of view Sum of discounted rewards [ ] V (s) = E w γ t r(s t ) s 0 = s t=0 Finite-horizon [ T 1 ] V T (s) = E w r(s t ) s 0 = s t=0

Criteria Continued Baseline reward [ ] V B (s) = E w r(s t ) r s 0 = s t=0 Here, r is an estimate of the Long-term average reward... Long-term average is intuitively appealing [ T 1 ] 1 V (s) = lim T T E w r(s t ) s 0 = s t=0

Discounted or Average? Ergodic MDP Positive recurrent: finite return times Irreducible: single recurrent set of states Aperiodic: GCD of return times = 1 If the Markov system is ergodic then V (s) = η for all s, i.e., η is constant over s Convert from discounted to long-term average η = (1 γ)e s V (s) We focus on discounted V (s) for Value methods

Average versus Discounted 1 V(1)=3.5 1 6 2 6 2 5 3 5 3 4 1 r(s) = s V(1)=14.3 4 1 V(4)=3.5 6 2 6 2 5 3 5 3 4 delta=0.8 4 V(4)=19.2

Dynamic Programming How do we compute V (s) for a fixed policy? Find fixed point V (s) solution to Bellman s Equation: V (s) = r(s) + γ a A Pr[s s, a] Pr[a s, w]v (s ) s S In matrix form with vectors V and r: Define stochastic transition matrix for current policy P = a A Pr[s s, a] Pr[a s, w] Now V = r + γpv Like shortest path algs, or Viterbi estimation

Analytic Solution V = r + γpv V γpv = r (I γp)v = r V = (I γp) 1 r Ax = b Computes V (s) for fixed policy (fixed w) No solution unless γ [0, 1) O( S 3 ) solution... not feasible

Progress... Value Policy MDP POMDP Value & Pol Iteration Model Based TD SARSA Q Learning Experience

Partial Observability We have assumed so far that o = s, i.e., full observability What if s is obscured? Markov assumption violated! Ostrich approach (SARSA works well in practice) Exact methods Direct policy search: bypass values, local convergence Best policy may need full history Pr[a t o t, a t 1, o t 1,..., a 1, o 1 ]

Belief States Belief states sufficiently summarise history b(s) = Pr[s o t, a t 1, o t 1,..., a 1, o 1 ] Probability of each world state computed from history Given belief b t for time t, can update for next action b t+1 (s ) = s S b t (s) Pr[s s, a t ] Now incorporate observation o t+1 as evidence for state s b t+1 (s) = b t+1 (s) Pr[o t+1 s] o O b t+1 Pr[o s] Like HMM forward estimation Just updating the belief state is O( S 2 )

Value Iteration For Belief States Do normal VI, but replace states with belief state b V (b) = r(b) + γ Pr[b b, a] Pr[a b, w]v (b ) b a Expanding out terms involving b V (b) = b(s)r(s)+ s S γ Pr[s s, a] Pr[o s ] Pr[a b, w]b(s)v (b (ao) ) a A o O s S s S What is V (b)? V (b) = max l L l b

Piecewise Linear Representation l0 common action u V(b) l1 l3 l2 l4 useless hyperplane b 1=1 b0 Belief state space

Policy-Graph Representation l0 common action u V(b) l1 l2 l3 l4 b 1=1 b0 a=1 observation 2 a=2 a=3 observation 1 a=1

Complexity High Level Value Iteration for POMDPs 1 Initialise b 0 (uniform/set state) 2 Receive observation o 3 Update belief state b 4 Find maximising hyperplane l for b 5 Choose action a 6 Generate new l for each observation and future action 7 While not converged, goto 2 Specifics generate lots of algorithms Number of hyperplanes grows exponentially: P-space hard Infinite horizon problems might need infinite hyperplanes

Approximate Value Methods for POMDPs Approximations usually learn value of representative belief states and interpolate to new belief states Belief space simplex corners are representative states Most Likely State heuristic (MLS) Q(b, a) = arg max Q(b(s), a) s Q MDP assumes true state is known after one more step Q(b, a) = s S b(s)q(s, a) Grid Methods distribute many belief states uniformly [5]

Progress... Value Policy MDP POMDP Model Based Exact VI Value & Pol Iteration TD SARSA Q Learning Experience SARSA?

Policy-Gradient Methods We all know what gradient ascent is? Value-gradient method: TD with function approximation Policy-gradient methods learn the policy directly by estimating the gradient of a long-term reward measure with respect to the parameters w that describe the policy Are there non-gradient direct policy methods? Search in policy space [10] Evolutionary algorithms [8] For the slides we give up the idea of belief states and work with observations o, i.e., Pr[a o, w]

Why Policy-Gradient Pro s No divergence, even under function approximation Occams Razor: policies are much simpler to represent Consider using a neural network to estimate a value, compared to choosing an action Partial observability does not hurt convergence (but of course, the best long-term value might drop) Are we trying to learn Q(0, left) = 0.255, Q(0, right) = 0.25 Or Q(0, left) > Q(0, right) Complexity independent of S

Why Not Policy-Gradient Con s Lost convergence to the globally optimal policy Lost the Bellman constraint larger variance Sometimes the values carry meaning

Long-Term Average Reward Recall the long-term average reward [ T 1 ] 1 V (s) = lim T T E w r(s t ) s 0 = s And if the Markov system is ergodic then V (s) = η for all s We will now assume a function approximation setting We want to maximise η(w) by computing its gradient [ η η(w) =,..., η ] w 1 w P t=0 and stepping the parameters in that direction. For example (but there are better ways to do it): w t+1 = w t + α η(w)

Computing the Gradient Recall the reward column vector r An ergodic system has a unique stationary distribution of states π(w) So η(w) = π(w) r Recall the state transition matrix under the current policy is P(w) = a A Pr[s s, a] Pr[a s, w] So π(w) = π(w) P(w)

Computing the Gradient Cont. We drop the explicit dependencies on w Let e be a column vector of 1 s The Gradient of the Long-Term Average Reward η = π ( P)(I P + eπ ) 1 r Exercise: derive this expression using 1 η = π r and π = π P 2 Start with η = ( π )r, and π = ( π )P + π ( P) 3 (I P) is not invertible, but (I P + eπ ) is 4 ( π )e = 0

Solution and η = ( π )r ( π ) = (π P) ( π ) ( π )P = π ( P) ( π )(I P) = π ( P) = ( π )P + π ( P) Now (I P) is not invertible, but (I P + eπ ) is. Also, π eπ = 0, so without changing the solution ( π )(I P + eπ ) = π ( P) π = π ( P)(I P + eπ ) 1 η = π ( P)(I P + eπ ) 1 r

Using η If we know P and r we can compute η exactly for small P π is the first eigenvector of P If P is sparse, this works well: Gradient Ascent of Modelled POMDPs (GAMP) [1] Found optimum policy for system with 26,000 states in 30s If state space is infinite, or just large, it becomes infeasible This expression is the basis for our experience based algorithm

Progress... GAMP Value Policy GAMP Exact VI MDP POMDP Value & Pol Iteration Model Based TD SARSA Q Learning Experience SARSA?

Experience Based Policy Gradient Problem: No model P? Too many states? Answer: Compute a Monte-Carlo estimate of the gradient η η = lim lim β 1 T 1 T T 1 t=0 Pr[s t+1 s t, a t ] Pr[s t+1 s t, a t ] T τ=t+1 Derived by applying the Ergodic Theorem to an approximation of the true gradient [3] η = lim β 1 π ( P)V (s), β τ t 1 r τ where [ ] V (s) = E w β t r(s t ) s 0 = s t=0

GPOMDP(w) (Gradient POMDP) 1 Initialise η = 0, T = 0 2 Initialise world randomly 3 Get observation o from world 4 Choose an action a Pr[ o, w] 5 Do action a 6 Receive reward r 7 e βe + Pr[a o,w] Pr[a o,w] 8 η η + 1 t+1 (re η) 9 t t + 1 10 While t < T, goto 3

Bias-Variance Trade-Off in Policy Gradient The parameter β ensures the estimate has finite variance var( η) 1 T (1 β) So β [0, 1) But as β decreases, the bias increases T should be at least the mixing time of the Markov process Mixing time is T it would take to get a good estimate of π This is hard to compute in general Rule of thumb for T : make T as large as possible Rule of thumb for β: increase β until gradient estimates become inconsistent

Load/Unload Demonstration r = 1 0 0 0 0 1 U N N N N L Agent must go to right to get a load, then go left to drop it Optimal policy: left if loaded, right otherwise A reactive (memoryless) policy is not sufficient Partial observability because agent cannot detect it is loaded

Results Algorithm mean η max. η var. Time (s) GAMP 2.39 2.50 0.116 0.22 GPOMDP 1.15 2.50 0.786 2.05 Inc. Prune. 2.50 2.50 0 3.27 Optimum 2.50 Average over 100 training and testing runs GPOMDP β = 0.8, T = 5000 Incremental Pruning is an exact POMDP value method

Natural Actor-Critic Current method of choice Combine scalability of policy-gradient with low variance of value methods Ideas: 1 Actor-Critic: Actor is policy-gradient learner Critic learns projection of the value function Critic value estimate improves actor learning 2 Natural gradient: Use Amari s natural gradient to accelerate convergence Keep an estimate of the Fisher information matrix inverse. NAC shows how to do this efficiently Jan Peters, Sethu Vijayakumar, Stefan Schaal (2005), Natural Actor-Critic, in the Proceedings of the 16th European Conference on Machine Learning (ECML 2005).

Progress... GAMP GPOMDP GAMP GPOMDP Value Policy MDP POMDP Exact VI Value & Pol Iteration Model Based TD SARSA Q Learning Experience SARSA?

The End Reinforcement learning is good when: performance feedback is unspecific, delayed, or unpredictable trying to optimise a non-linear feedback system Reinforcement learning is bad because: very slow to learn in large environments with weak rewards if you don t have an appropriate reward, what are you learning? Areas we have not considered: How can we factorise state spaces and value functions? [9] What happens to exact POMDP methods when S is infinite? [13] Taking advantage of history in direct policy methods [2] How can we reduce variance in all methods? Combining experience based methods with DP methods [7]

[1] Douglas Aberdeen and Jonathan Baxter. Internal-state policy-gradient algorithms for infinite-horizon POMDPs. Technical report, RSISE, Australian National University, 2002. http://discus.anu.edu.au/~daa/papers.html. [2] Douglas A. Aberdeen. Policy-Gradient Algorithms for Partially Observable Markov Decision Processes. PhD thesis, Australian National Unversity, March 2003. [3] Jonathan Baxter and Peter L. Bartlett. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319 350, 2001. [4] Jonathan Baxter, Andrew Tridgell, and Lex Weaver. KnightCap: A chess program that learns by combining TD(λ) with game-tree search. In Proceedings of the Fifteenth International Conference on Machine Learning, pages 28 36. Morgan Kaufmann, 1998. [5] Blai Bonet. An ɛ-optimal grid-based algorithm for partially observable Markov decision processes. In 19th International Conference on Machine Learning, Sydney, Australia, June 2002. [6] Robert H. Crites and Andrew G. Barto. Elevator group control using multiple reinforcement learning agents. Machine Learning, 33(2-3):235 262, 1998. [7] Héctor Geffner and Blai Bonet. Solving large POMDPs by real time dynamic programming. Working Notes Fall AAAI Symposium on POMDPs, 1998. http://www.cs.ucla.edu/~bonet/.

[8] Matthew R. Glickman and Katia Sycara. Evolutionary search, stochastic policies with memory, and reinforcement learning with hidden state. In Proceedings of the Eighteenth International Conference on Machine Learning, pages 194 201. Morgan Kaufmann, June 2001. [9] Carlos Guestrin, Daphne Koller, and Ronald Parr. Solving factored POMDPs with linear value functions. In IJCAI-01 workshop on Palling under Uncertainty and Incomplete Information, Seattle, Washington, August 2001. [10] Nicolas Meuleau, Kee-Eung Kim, Leslie Pack Kaelbling, and Anthony R. Cassandra. Solving POMDPs by searching the space of finite policies. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pages 127 136. Computer Science Dept., Brown University, Morgan Kaufmann, July 1999. [11] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge MA, 1998. ISBN 0-262-19398-1. [12] Gerald Tesauro. TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 6:215 219, 1994. [13] Sebastian Thrun. Monte Carlo POMDPs. In Advances in Neural Information Processing Systems 12. MIT Press, 2000. http://citeseer.nj.nec.com/thrun99monte.html.