State Space Abstraction for Reinforcement Learning
|
|
- Ann Cross
- 5 years ago
- Views:
Transcription
1 State Space Abstraction for Reinforcement Learning Rowan McAllister & Thang Bui MLG, Cambridge Nov 6th, 24 / 26
2 Rowan s introduction 2 / 26
3 Types of abstraction [LWL6] Abstractions are partitioned based on state-action value functions Q(s, a) and Q (s, a) Loosely categorised into five categories / 26
4 Model-irrelevance abstraction φ model Definition φ model {(s ) = φ model (s 2 ) R a = s = R s a 2 s φ model (x) Pa s,s = s φ model (x) Pa s 2,s x, a 4 / 26
5 Model-irrelevance abstraction φ model Definition φ model {(s ) = φ model (s 2 ) R a = s = R s a 2 s φ model (x) Pa s,s = s φ model (x) Pa s 2,s x, a In words, for any action a, ground states in the same abstract class should have the same reward, and have the same transition probability into a new abstract class. 4 / 26
6 Model-irrelevance abstraction φ model Definition φ model {(s ) = φ model (s 2 ) R a = s = R s a 2 s φ model (x) Pa s,s = s φ model (x) Pa s 2,s x, a In words, for any action a, ground states in the same abstract class should have the same reward, and have the same transition probability into a new abstract class. Y= Y= 2 2 X= X= [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] 4 / 26
7 Model-irrelevance abstraction φ model Definition φ model {(s ) = φ model (s 2 ) R a = s = R s a 2 s φ model (x) Pa s,s = s φ model (x) Pa s 2,s x, a In words, for any action a, ground states in the same abstract class should have the same reward, and have the same transition probability into a new abstract class. Y= Y= 2 2 X= X= [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] 4 / 26
8 Types of abstraction [cont d] Definition (Q π -irrelevance abstraction) φ Q π(s ) = φ Q π(s 2 ) = Q π (s, a) = Q π (s 2, a), π, a 5 / 26
9 Types of abstraction [cont d] Definition (Q π -irrelevance abstraction) φ Q π(s ) = φ Q π(s 2 ) = Q π (s, a) = Q π (s 2, a), π, a In words, for any action a, ground states in the same abstract class should have the same state-action value function. 5 / 26
10 Types of abstraction [cont d] Definition (Q π -irrelevance abstraction) φ Q π(s ) = φ Q π(s 2 ) = Q π (s, a) = Q π (s 2, a), π, a In words, for any action a, ground states in the same abstract class should have the same state-action value function. Y= Y= 2 2 [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] X= X= 5 / 26
11 Types of abstraction [cont d] Definition (Q π -irrelevance abstraction) φ Q π(s ) = φ Q π(s 2 ) = Q π (s, a) = Q π (s 2, a), π, a In words, for any action a, ground states in the same abstract class should have the same state-action value function. Y= Y= 2 2 [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] X= X= 5 / 26
12 Types of abstraction [cont d] Definition (Q -irrelevance abstraction) φ Q (s ) = φ Q (s 2 ) = Q (s, a) = Q (s 2, a), a 6 / 26
13 Types of abstraction [cont d] Definition (Q -irrelevance abstraction) φ Q (s ) = φ Q (s 2 ) = Q (s, a) = Q (s 2, a), a In words, for any action a, ground states in the same abstract class should have the same optimal state-action value function, or same optimal policy. 6 / 26
14 Types of abstraction [cont d] Definition (Q -irrelevance abstraction) φ Q (s ) = φ Q (s 2 ) = Q (s, a) = Q (s 2, a), a In words, for any action a, ground states in the same abstract class should have the same optimal state-action value function, or same optimal policy. Y= Y= 2 2 [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] X= X= 6 / 26
15 Types of abstraction [cont d] Definition (Q -irrelevance abstraction) φ Q (s ) = φ Q (s 2 ) = Q (s, a) = Q (s 2, a), a In words, for any action a, ground states in the same abstract class should have the same optimal state-action value function, or same optimal policy. Y= Y= 2 2 [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] X= X= 6 / 26
16 Types of abstraction [cont d] Definition (a -irrelevance abstraction) φ a (s{ ) = φ a (s 2 ) a = a is optimal for s and s 2 Q (s, a ) = max a Q (s, a) = max a Q (s 2, a) = Q (s 2, a ) 7 / 26
17 Types of abstraction [cont d] Definition (a -irrelevance abstraction) φ a (s{ ) = φ a (s 2 ) a = a is optimal for s and s 2 Q (s, a ) = max a Q (s, a) = max a Q (s 2, a) = Q (s 2, a ) In words, ground states in the same abstract class should have the same optimal action, and have the same optimal state-action value function upon taking optimal action. 7 / 26
18 Types of abstraction [cont d] Definition (a -irrelevance abstraction) φ a (s{ ) = φ a (s 2 ) a = a is optimal for s and s 2 Q (s, a ) = max a Q (s, a) = max a Q (s 2, a) = Q (s 2, a ) In words, ground states in the same abstract class should have the same optimal action, and have the same optimal state-action value function upon taking optimal action. Y= Y= 2 2 X= X= [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] 7 / 26
19 Types of abstraction [cont d] Definition (a -irrelevance abstraction) φ a (s{ ) = φ a (s 2 ) a = a is optimal for s and s 2 Q (s, a ) = max a Q (s, a) = max a Q (s 2, a) = Q (s 2, a ) In words, ground states in the same abstract class should have the same optimal action, and have the same optimal state-action value function upon taking optimal action. Y= Y= 2 2 X= X= [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] 7 / 26
20 And even more abstract... Definition (π -irrelevance abstraction) φ π (s ) = φ π (s 2 ) = a a is optimal for s and s 2 8 / 26
21 And even more abstract... Definition (π -irrelevance abstraction) φ π (s ) = φ π (s 2 ) = a a is optimal for s and s 2 In words, ground states in the same abstract class should have the same optimal action. 8 / 26
22 And even more abstract... Definition (π -irrelevance abstraction) φ π (s ) = φ π (s 2 ) = a a is optimal for s and s 2 In words, ground states in the same abstract class should have the same optimal action. Y= Y= 2 2 [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] X= X= 8 / 26
23 And even more abstract... Definition (π -irrelevance abstraction) φ π (s ) = φ π (s 2 ) = a a is optimal for s and s 2 In words, ground states in the same abstract class should have the same optimal action. Y= Y= 2 2 [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] X= X= 8 / 26
24 And even more abstract... Definition (π -irrelevance abstraction) φ π (s ) = φ π (s 2 ) = a a is optimal for s and s 2 In words, ground states in the same abstract class should have the same optimal action. Y= Y= 2 2 [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] X= X= 8 / 26
25 Types of abstraction - An example (a) 4 s / 26
26 Types of abstraction - An example (a) 4 (b).5 s 2 4 s 2 9 / 26
27 Types of abstraction - A comparison Questions: Same optimal policy in both abstract space and ground space? Same optimal state-action value functions? / 26
28 Types of abstraction - A comparison Questions: Same optimal policy in both abstract space and ground space? Same optimal state-action value functions? Type Same optimal policy Q-learning optimality φ model φ Q π φ Q φ a φ π / 26
29 Types of abstraction - A comparison Questions: Same optimal policy in both abstract space and ground space? Same optimal state-action value functions? Type Same optimal policy Q-learning optimality φ model φ Q π φ Q φ a φ π (b).5 a a s 2 / 26
30 Predictive State Representation: An introduction / 26
31 A Zoubin cube for HMM/POMDP/PSR... 2 / 26
32 A Zoubin cube for HMM/POMDP/PSR... 2 / 26
33 Predictive State Representation - Overview A model for controlled dynamical systems / 26
34 Predictive State Representation - Overview A model for controlled dynamical systems Why useful? / 26
35 Predictive State Representation - Overview A model for controlled dynamical systems Why useful? A compact representation compared to k-markov or POMDP models / 26
36 Predictive State Representation - Overview A model for controlled dynamical systems Why useful? A compact representation compared to k-markov or POMDP models x / 26
37 Predictive State Representation - Overview A model for controlled dynamical systems Why useful? A compact representation compared to k-markov or POMDP models x / 26
38 Predictive State Representation - Overview A model for controlled dynamical systems Why useful? A compact representation compared to k-markov or POMDP models x Speedy learning / 26
39 A review of POMDPs 4 / 26
40 A review of POMDPs A POMDP is defined by a tuple S, A, O, T, O, b : S, A, O: sets of hidden states, actions and observations, dim(s) = n T : set of n n transition matrices, one for each action a A O: set of n n diagonal matrices, one for each action-observation pair b : initial belief state, {p(s i )} n i= 4 / 26
41 A review of POMDPs A POMDP is defined by a tuple S, A, O, T, O, b : S, A, O: sets of hidden states, actions and observations, dim(s) = n T : set of n n transition matrices, one for each action a A O: set of n n diagonal matrices, one for each action-observation pair b : initial belief state, {p(s i )} n i= POMDPs use a belief state to encode historical information Belief state at history h: b(s h) = {p(s i h)} n i= 4 / 26
42 A review of POMDPs A POMDP is defined by a tuple S, A, O, T, O, b : S, A, O: sets of hidden states, actions and observations, dim(s) = n T : set of n n transition matrices, one for each action a A O: set of n n diagonal matrices, one for each action-observation pair b : initial belief state, {p(s i )} n i= POMDPs use a belief state to encode historical information Belief state at history h: b(s h) = {p(s i h)} n i= After taking an action a and observing o, the belief state can be updated: b(s hao) = b(s h)t a O a,o b(s h)t a O a,o n 4 / 26
43 A review of POMDPs A POMDP is defined by a tuple S, A, O, T, O, b : S, A, O: sets of hidden states, actions and observations, dim(s) = n T : set of n n transition matrices, one for each action a A O: set of n n diagonal matrices, one for each action-observation pair b : initial belief state, {p(s i )} n i= POMDPs use a belief state to encode historical information Belief state at history h: b(s h) = {p(s i h)} n i= After taking an action a and observing o, the belief state can be updated: b(s hao) = b(s h)t a O a,o b(s h)t a O a,o n PSRs remove hidden states and use probabilities of particular future actions/outcomes to encode historical information. 4 / 26
44 Predictive State Representation - Notations 5 / 26
45 Predictive State Representation - Notations A test or experiment t is a sequence of actions and observations t = a o a 2 o 2 a k o k 5 / 26
46 Predictive State Representation - Notations A test or experiment t is a sequence of actions and observations t = a o a 2 o 2 a k o k Observed history h of length n h = a o a 2 o 2 a n o n 5 / 26
47 Predictive State Representation - Notations A test or experiment t is a sequence of actions and observations t = a o a 2 o 2 a k o k Observed history h of length n h = a o a 2 o 2 a n o n Test prediction given history p(t h) = prob(o n+ = o,, o n+k = o k h, a n+ = a,, a n+k = a k ) 5 / 26
48 Predictive State Representation - Notations A test or experiment t is a sequence of actions and observations t = a o a 2 o 2 a k o k Observed history h of length n h = a o a 2 o 2 a n o n Test prediction given history p(t h) = prob(o n+ = o,, o n+k = o k h, a n+ = a,, a n+k = a k ) Set of m core tests T T = {t, t 2,, t m} p(t h) = {p(t h), p(t 2 h),, p(t m h)} 5 / 26
49 Predictive State Representation - Definition 6 / 26
50 Predictive State Representation - Definition Definition ([LSS2]) A set T is a PSR iff its prediction vector forms a sufficient statistic for the dynamical system, or for any test t and history h. p(t h) = f t (p(t h)), 6 / 26
51 Predictive State Representation - Definition Definition ([LSS2]) A set T is a PSR iff its prediction vector forms a sufficient statistic for the dynamical system, or for any test t and history h. p(t h) = f t (p(t h)), Mapping f t (.) could be linear or non-linear Linear f t (.) means for all tests t, w t s.t. p(t h) = p (T h)w t 6 / 26
52 Predictive State Representation - Definition Definition ([LSS2]) A set T is a PSR iff its prediction vector forms a sufficient statistic for the dynamical system, or for any test t and history h. p(t h) = f t (p(t h)), Mapping f t (.) could be linear or non-linear Linear f t (.) means for all tests t, w t s.t. p(t h) = p (T h)w t Updating p(t h) upon taking action a and observing o p(t i hao) = p(aot i h) p(ao h) = f aoti (p(t h)) f ao (p(t h)) = p (T h)w aot i p (T h)w ao 6 / 26
53 Predictive State Representation - Why use PSR? 7 / 26
54 Predictive State Representation - Why use PSR? Theorem ([LSS2]) For any environment that can be represented by a finite POMDP model, there exists a linear PSR with number of tests m no larger than the number of states in the minimal POMDP model. 7 / 26
55 Predictive State Representation - Why use PSR? Theorem ([LSS2]) For any environment that can be represented by a finite POMDP model, there exists a linear PSR with number of tests m no larger than the number of states in the minimal POMDP model. 7 / 26
56 Predictive State Representation - Why use PSR? Theorem ([LSS2]) For any environment that can be represented by a finite POMDP model, there exists a linear PSR with number of tests m no larger than the number of states in the minimal POMDP model. POMDP: 5 states Linear PSR: T = {r, f r, f f r, f f f r, f f f f r} Nonlinear PSR: T = {r, f r} 7 / 26
57 Predictive State Representation - A system dynamics matrix System dynamics matrix D specifies predictions of any test given any history. D t t i h p(t h ) p(t i h ) h p(t h ) p(t i h ). h j p(t h j ) p(t i h j ). 8 / 26
58 Predictive State Representation - A system dynamics matrix System dynamics matrix D specifies predictions of any test given any history. D t t i h p(t h ) p(t i h ) h p(t h ) p(t i h ). h j p(t h j ) p(t i h j ). D has an infinite number of columns and an infinite number of rows! 8 / 26
59 Predictive State Representation - A system dynamics matrix System dynamics matrix D specifies predictions of any test given any history. D t t i h p(t h ) p(t i h ) h p(t h ) p(t i h ). h j p(t h j ) p(t i h j ). D has an infinite number of columns and an infinite number of rows! Let T be the set of linearly independent columns of D, then T is a linear PSR, as D(:, t) = D(:, T )w t or p(t h) = p(t h)w t 8 / 26
60 Predictive State Representation - A system dynamics matrix System dynamics matrix D specifies predictions of any test given any history. D t t i h p(t h ) p(t i h ) h p(t h ) p(t i h ). h j p(t h j ) p(t i h j ). D has an infinite number of columns and an infinite number of rows! Let T be the set of linearly independent columns of D, then T is a linear PSR, as D(:, t) = D(:, T )w t or p(t h) = p(t h)w t Define H be the set of linearly independent rows of D, then H forms a set of core histories 8 / 26
61 Predictive State Representation - How to discover T [JS4] 9 / 26
62 Predictive State Representation - How to discover T [JS4] Data: Matrix D Result: T = {t, t 2,, t m} Initialisation: v = ; T = ; H = ; while true do Form one step extension matrix of D H,T [ ] DH,T D D H,T = H,T + D H+,T D H+,T + T linearly independent columns of D H,T H linearly independent rows of D H,T v i rank of D H,T if Stopping condition: v i == v i then stop end end 9 / 26
63 Predictive State Representation - How to discover T [JS4] Data: Matrix D Result: T = {t, t 2,, t m} Initialisation: v = ; T = ; H = ; while true do Form one step extension matrix of D H,T [ ] DH,T D D H,T = H,T + D H+,T D H+,T + T linearly independent columns of D H,T H linearly independent rows of D H,T v i rank of D H,T if Stopping condition: v i == v i then stop end end Caveats: D cannot be computed exactly in general, hence sampling is needed. T may not be complete, as v i == v i is too strict. 9 / 26
64 Predictive State Representation - How to learn w t in linear PSRs [JS4] 2 / 26
65 Predictive State Representation - How to learn w t in linear PSRs [JS4] Recall: p(aot i h) = f aot i (p(t h)) = p(t h)w aot i p(ao h) = f ao (p(t h)) = p(t h)w ao 2 / 26
66 Predictive State Representation - How to learn w t in linear PSRs [JS4] Recall: p(aot i h) = f aot i (p(t h)) = p(t h)w aot i p(ao h) = f ao (p(t h)) = p(t h)w ao Note that w t does not depend on h, hence p(aot i H) = p(t H)w aot i p(ao H) = p(t H)w ao 2 / 26
67 Predictive State Representation - How to learn w t in linear PSRs [JS4] Recall: p(aot i h) = f aot i (p(t h)) = p(t h)w aot i p(ao h) = f ao (p(t h)) = p(t h)w ao Note that w t does not depend on h, hence p(aoti H) = p(t H)w aot i p(ao H) = p(t H)w ao As dim(t ) = dim(h) and rows in p(t H) are linearly independent, w aot i = p (T H)p(aot i H) w ao = p (T H)p(ao H) 2 / 26
68 Predictive State Representation - Planning PSRs can capture the regularities of the environment, represent the environment is a more compact way, speed up learning process. 2 / 26
69 Predictive State Representation - Planning PSRs can capture the regularities of the environment, represent the environment is a more compact way, speed up learning process. An example from [RRST5]: G 2 / 26
70 Predictive State Representation - Learning rate [RRST5] Steps n =4 n =5 n =6 k=2 k= Environmental State k=4 k=6 k=5 n = 42.2 Episodes 22 / 26
71 Predictive State Representation - Learning rate [RRST5] Steps n =4 n =5 n =6 k=2 k= Environmental State k=4 k=6 k=5 n = 42.2 Episodes 22 / 26
72 Predictive State Representation - Learning rate [RRST5] Steps n =4 n =5 n =6 k=2 k= Environmental State k=4 k=6 k=5 n = 42.2 Episodes 22 / 26
73 Predictive State Representation - What s next? For continuous observations: HMM LDS PSR Predictive Linear-Gaussian [RS6] ioom/ipsr: set of core tests T grows while exploring the environment? 2 / 26
74 Summary State space abstraction can provide state space generalisations, can speed up learning/planning in reinforcement learning, is an alternative to value function approximations. Predictive state representations as an alternative to POMDPs model systems using only observable quantities and no hidden states are compact, but hard to learn 24 / 26
75 References Michael R James and Satinder Singh, Learning and discovery of predictive state representations in dynamical systems with reset, Proceedings of the 2st International Conference on Machine learning, ACM, 24, p. 5. Michael Littman, Richard Sutton, and Satinder Singh, Predictive representations of state, Advances in Neural Information Processing Systems 4 (NIPS), 22, pp Lihong Li, Thomas J Walsh, and Michael L Littman, Towards a unified theory of state abstraction for MDPs., ISAIM, 26. Eddie J Rafols, Mark B Ring, Richard S Sutton, and Brian Tanner, Using predictive representations to improve generalization in reinforcement learning., IJCAI, 25, pp Matthew Rudary and Satinder Singh, Predictive linear-gaussian models of controlled stochastic dynamical systems, Proceedings of the 2rd International Conference on Machine learning, ACM, 26, pp / 26
76 Thanks!
A Nonlinear Predictive State Representation
Draft: Please do not distribute A Nonlinear Predictive State Representation Matthew R. Rudary and Satinder Singh Computer Science and Engineering University of Michigan Ann Arbor, MI 48109 {mrudary,baveja}@umich.edu
More informationState Space Abstractions for Reinforcement Learning
State Space Abstractions for Reinforcement Learning Rowan McAllister and Thang Bui MLG RCC 6 November 24 / 24 Outline Introduction Markov Decision Process Reinforcement Learning State Abstraction 2 Abstraction
More informationCombining Memory and Landmarks with Predictive State Representations
Combining Memory and Landmarks with Predictive State Representations Michael R. James and Britton Wolfe and Satinder Singh Computer Science and Engineering University of Michigan {mrjames, bdwolfe, baveja}@umich.edu
More informationOn Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets
On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets Pablo Samuel Castro pcastr@cs.mcgill.ca McGill University Joint work with: Doina Precup and Prakash
More informationPlanning in POMDPs Using Multiplicity Automata
Planning in POMDPs Using Multiplicity Automata Eyal Even-Dar School of Computer Science Tel-Aviv University Tel-Aviv, Israel 69978 evend@post.tau.ac.il Sham M. Kakade Computer and Information Science University
More informationLearning Low Dimensional Predictive Representations
Learning Low Dimensional Predictive Representations Matthew Rosencrantz MROSEN@CS.CMU.EDU Computer Science Department, Carnegie Mellon University, Forbes Avenue, Pittsburgh, PA, USA Geoff Gordon GGORDON@CS.CMU.EDU
More informationPlanning with Predictive State Representations
Planning with Predictive State Representations Michael R. James University of Michigan mrjames@umich.edu Satinder Singh University of Michigan baveja@umich.edu Michael L. Littman Rutgers University mlittman@cs.rutgers.edu
More information, and rewards and transition matrices as shown below:
CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount
More informationPredictive Timing Models
Predictive Timing Models Pierre-Luc Bacon McGill University pbacon@cs.mcgill.ca Borja Balle McGill University bballe@cs.mcgill.ca Doina Precup McGill University dprecup@cs.mcgill.ca Abstract We consider
More informationArtificial Intelligence
Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important
More informationDecision Theory: Q-Learning
Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning
More informationLearning in Zero-Sum Team Markov Games using Factored Value Functions
Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer
More informationMachine Learning I Continuous Reinforcement Learning
Machine Learning I Continuous Reinforcement Learning Thomas Rückstieß Technische Universität München January 7/8, 2010 RL Problem Statement (reminder) state s t+1 ENVIRONMENT reward r t+1 new step r t
More informationModeling Dynamical Systems with Structured Predictive State Representations
Modeling Dynamical Systems with Structured Predictive State Representations by Britton D. Wolfe A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy
More informationLecture 3: Markov Decision Processes
Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov
More informationReinforcement Learning. Introduction
Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control
More informationAverage Reward Optimization Objective In Partially Observable Domains
Average Reward Optimization Objective In Partially Observable Domains Yuri Grinberg School of Computer Science, McGill University, Canada Doina Precup School of Computer Science, McGill University, Canada
More informationTemporal difference learning
Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).
More informationModeling Multiple-mode Systems with Predictive State Representations
Modeling Multiple-mode Systems with Predictive State Representations Britton Wolfe Computer Science Indiana University-Purdue University Fort Wayne wolfeb@ipfw.edu Michael R. James AI and Robotics Group
More informationPolicy-Gradients for PSRs and POMDPs
Policy-Gradients for PSRs and POMDPs Douglas Aberdeen doug.aberdeen@anu.edu.au Olivier Buffet olivier.buffet@nicta.com.au National ICT Australia Locked Bag 8001, Canberra Australia Owen Thomas owen.thomas@nicta.com.au
More informationDialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department
Dialogue management: Parametric approaches to policy optimisation Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department 1 / 30 Dialogue optimisation as a reinforcement learning
More informationLeast squares policy iteration (LSPI)
Least squares policy iteration (LSPI) Charles Elkan elkan@cs.ucsd.edu December 6, 2012 1 Policy evaluation and policy improvement Let π be a non-deterministic but stationary policy, so p(a s; π) is the
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationPartially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague
Partially observable Markov decision processes Jiří Kléma Department of Computer Science, Czech Technical University in Prague https://cw.fel.cvut.cz/wiki/courses/b4b36zui/prednasky pagenda Previous lecture:
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationAn Introduction to Markov Decision Processes. MDP Tutorial - 1
An Introduction to Markov Decision Processes Bob Givan Purdue University Ron Parr Duke University MDP Tutorial - 1 Outline Markov Decision Processes defined (Bob) Objective functions Policies Finding Optimal
More informationDecision Theory: Markov Decision Processes
Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies
More informationLearning to Make Predictions In Partially Observable Environments Without a Generative Model
Journal of Artificial Intelligence Research 42 (2011) 353-392 Submitted 5/11; published 11/11 Learning to Make Predictions In Partially Observable Environments Without a Generative Model Erik Talvitie
More informationState Space Compression with Predictive Representations
State Space Compression with Predictive Representations Abdeslam Boularias Laval University Quebec GK 7P4, Canada Masoumeh Izadi McGill University Montreal H3A A3, Canada Brahim Chaib-draa Laval University
More informationUsing Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *
Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms
More informationOpen Theoretical Questions in Reinforcement Learning
Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem
More informationTemporal Difference Learning & Policy Iteration
Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and
More informationReinforcement Learning
Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationFinal Exam December 12, 2017
Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes
More informationReinforcement Learning and Deep Reinforcement Learning
Reinforcement Learning and Deep Reinforcement Learning Ashis Kumer Biswas, Ph.D. ashis.biswas@ucdenver.edu Deep Learning November 5, 2018 1 / 64 Outlines 1 Principles of Reinforcement Learning 2 The Q
More informationBias-Variance Error Bounds for Temporal Difference Updates
Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds
More informationFinal Exam December 12, 2017
Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes
More informationTopics of Active Research in Reinforcement Learning Relevant to Spoken Dialogue Systems
Topics of Active Research in Reinforcement Learning Relevant to Spoken Dialogue Systems Pascal Poupart David R. Cheriton School of Computer Science University of Waterloo 1 Outline Review Markov Models
More informationEfficient Learning in Linearly Solvable MDP Models
Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Efficient Learning in Linearly Solvable MDP Models Ang Li Department of Computer Science, University of Minnesota
More informationArtificial Intelligence & Sequential Decision Problems
Artificial Intelligence & Sequential Decision Problems (CIV6540 - Machine Learning for Civil Engineers) Professor: James-A. Goulet Département des génies civil, géologique et des mines Chapter 15 Goulet
More informationPolicy Gradient Methods. February 13, 2017
Policy Gradient Methods February 13, 2017 Policy Optimization Problems maximize E π [expression] π Fixed-horizon episodic: T 1 Average-cost: lim T 1 T r t T 1 r t Infinite-horizon discounted: γt r t Variable-length
More informationApproximate Predictive State Representations
Approximate Predictive State Representations Britton Wolfe Computer Science and Engineering University of Michigan Ann Arbor, MI 48109 bdwolfe@umich.edu Michael R. James AI and Robotics Group Toyota Technical
More information6 Reinforcement Learning
6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,
More informationAbstraction in Predictive State Representations
Abstraction in Predictive State Representations Vishal Soni and Satinder Singh Computer Science and Engineering University of Michigan, Ann Arbor {soniv,baveja}@umich.edu Abstract Most work on Predictive
More informationOnline regret in reinforcement learning
University of Leoben, Austria Tübingen, 31 July 2007 Undiscounted online regret I am interested in the difference (in rewards during learning) between an optimal policy and a reinforcement learner: T T
More informationPartially Observable Markov Decision Processes (POMDPs)
Partially Observable Markov Decision Processes (POMDPs) Geoff Hollinger Sequential Decision Making in Robotics Spring, 2011 *Some media from Reid Simmons, Trey Smith, Tony Cassandra, Michael Littman, and
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationReinforcement Learning
Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha
More informationMachine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396
Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction
More informationIntroduction to Reinforcement Learning Part 1: Markov Decision Processes
Introduction to Reinforcement Learning Part 1: Markov Decision Processes Rowan McAllister Reinforcement Learning Reading Group 8 April 2015 Note I ve created these slides whilst following Algorithms for
More informationToward Good Abstractions for Lifelong Learning
Toward Good Abstractions for Lifelong Learning David Abel, Dilip Arumugam, Lucas Lehnert, Michael L. Littman Department of Computer Science Brown University Providence, RI 02906 {david abel,dilip arumugam,lucas
More informationCourse 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016
Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the
More informationMarkov Decision Processes
Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour
More informationPlanning in Markov Decision Processes
Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov
More informationReinforcement Learning
Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.
More informationMarkov Decision Processes and Solving Finite Problems. February 8, 2017
Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:
More informationReinforcement Learning as Classification Leveraging Modern Classifiers
Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions
More informationMachine Learning I Reinforcement Learning
Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:
More informationMarks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:
Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,
More informationReinforcement Learning and NLP
1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value
More informationChapter 16 Planning Based on Markov Decision Processes
Lecture slides for Automated Planning: Theory and Practice Chapter 16 Planning Based on Markov Decision Processes Dana S. Nau University of Maryland 12:48 PM February 29, 2012 1 Motivation c a b Until
More informationHilbert Space Embeddings of Predictive State Representations
Hilbert Space Embeddings of Predictive State Representations Byron Boots Computer Science and Engineering Dept. University of Washington Seattle, WA Arthur Gretton Gatsby Unit University College London
More informationAdministration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.
Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,
More informationPlanning and Acting in Partially Observable Stochastic Domains
Planning and Acting in Partially Observable Stochastic Domains Leslie Pack Kaelbling*, Michael L. Littman**, Anthony R. Cassandra*** *Computer Science Department, Brown University, Providence, RI, USA
More informationSpectral Learning of Predictive State Representations with Insufficient Statistics
Spectral Learning of Predictive State Representations with Insufficient Statistics Alex Kulesza and Nan Jiang and Satinder Singh Computer Science & Engineering University of Michigan Ann Arbor, MI, USA
More informationLogic, Knowledge Representation and Bayesian Decision Theory
Logic, Knowledge Representation and Bayesian Decision Theory David Poole University of British Columbia Overview Knowledge representation, logic, decision theory. Belief networks Independent Choice Logic
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Formal models of interaction Daniel Hennes 27.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Taxonomy of domains Models of
More informationThe Markov Decision Process Extraction Network
The Markov Decision Process Extraction Network Siegmund Duell 1,2, Alexander Hans 1,3, and Steffen Udluft 1 1- Siemens AG, Corporate Research and Technologies, Learning Systems, Otto-Hahn-Ring 6, D-81739
More informationDynamic Probabilistic Models for Latent Feature Propagation in Social Networks
Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks Creighton Heaukulani and Zoubin Ghahramani University of Cambridge TU Denmark, June 2013 1 A Network Dynamic network data
More informationPOMDPs and Policy Gradients
POMDPs and Policy Gradients MLSS 2006, Canberra Douglas Aberdeen Canberra Node, RSISE Building Australian National University 15th February 2006 Outline 1 Introduction What is Reinforcement Learning? Types
More informationCAP Plan, Activity, and Intent Recognition
CAP6938-02 Plan, Activity, and Intent Recognition Lecture 10: Sequential Decision-Making Under Uncertainty (part 1) MDPs and POMDPs Instructor: Dr. Gita Sukthankar Email: gitars@eecs.ucf.edu SP2-1 Reminder
More informationAn Introduction to Reinforcement Learning
An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@csa.iisc.ernet.in Department of Computer Science and Automation Indian Institute of Science August 2014 What is Reinforcement
More informationAn Adaptive Clustering Method for Model-free Reinforcement Learning
An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at
More informationPolicies Generalization in Reinforcement Learning using Galois Partitions Lattices
Policies Generalization in Reinforcement Learning using Galois Partitions Lattices Marc Ricordeau and Michel Liquière mricorde@wanadoo.fr, liquiere@lirmm.fr Laboratoire d Informatique, de Robotique et
More informationMultiagent Value Iteration in Markov Games
Multiagent Value Iteration in Markov Games Amy Greenwald Brown University with Michael Littman and Martin Zinkevich Stony Brook Game Theory Festival July 21, 2005 Agenda Theorem Value iteration converges
More informationReinforcement Learning
Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms
More informationMarkov decision processes (MDP) CS 416 Artificial Intelligence. Iterative solution of Bellman equations. Building an optimal policy.
Page 1 Markov decision processes (MDP) CS 416 Artificial Intelligence Lecture 21 Making Complex Decisions Chapter 17 Initial State S 0 Transition Model T (s, a, s ) How does Markov apply here? Uncertainty
More informationREINFORCEMENT LEARNING
REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents
More information(Deep) Reinforcement Learning
Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015
More informationReinforcement learning
Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error
More informationBayesian reinforcement learning and partially observable Markov decision processes November 6, / 24
and partially observable Markov decision processes Christos Dimitrakakis EPFL November 6, 2013 Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 1 / 24
More informationOptimism in the Face of Uncertainty Should be Refutable
Optimism in the Face of Uncertainty Should be Refutable Ronald ORTNER Montanuniversität Leoben Department Mathematik und Informationstechnolgie Franz-Josef-Strasse 18, 8700 Leoben, Austria, Phone number:
More informationSymbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning
Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning Pascal Poupart (University of Waterloo) INFORMS 2009 1 Outline Dynamic Pricing as a POMDP Symbolic Perseus
More informationReinforcement Learning
Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of
More informationReal Time Value Iteration and the State-Action Value Function
MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing
More informationTowards a Unified Theory of State Abstraction for MDPs
Towards a Unified Theory of State Abstraction for MDPs Lihong Li Thomas J. Walsh Michael L. Littman Department of Computer Science Rutgers University, Piscataway, NJ 08854 {lihong,thomaswa,mlittman}@cs.rutgers.edu
More information16.410/413 Principles of Autonomy and Decision Making
16.410/413 Principles of Autonomy and Decision Making Lecture 23: Markov Decision Processes Policy Iteration Emilio Frazzoli Aeronautics and Astronautics Massachusetts Institute of Technology December
More informationREINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning
REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari
More informationReinforcement Learning. Yishay Mansour Tel-Aviv University
Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak
More informationAn Introduction to Reinforcement Learning
An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018 What is Reinforcement
More informationA Predictive Model for Imitation Learning in Partially Observable Environments
A Predictive Model for Imitation Learning in Partially Observable Environments Abdeslam Boularias Department of Computer Science, Laval University, Canada, G1K 7P4 boularias@damas.ift.ulaval.ca Abstract
More informationQ-Learning in Continuous State Action Spaces
Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental
More informationReinforcement learning an introduction
Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,
More informationRecurrent Gradient Temporal Difference Networks
Recurrent Gradient Temporal Difference Networks David Silver Department of Computer Science, CSML, University College London London, WC1E 6BT d.silver@cs.ucl.ac.uk Abstract Temporal-difference (TD) networks
More informationRL 14: Simplifications of POMDPs
RL 14: Simplifications of POMDPs Michael Herrmann University of Edinburgh, School of Informatics 04/03/2016 POMDPs: Points to remember Belief states are probability distributions over states Even if computationally
More informationMiscellaneous Results, Solving Equations, and Generalized Inverses. opyright c 2012 Dan Nettleton (Iowa State University) Statistics / 51
Miscellaneous Results, Solving Equations, and Generalized Inverses opyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 1 / 51 Result A.7: Suppose S and T are vector spaces. If S T and
More informationARTIFICIAL INTELLIGENCE. Reinforcement learning
INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html
More informationToday s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning
CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides
More information