State Space Abstraction for Reinforcement Learning

Size: px
Start display at page:

Download "State Space Abstraction for Reinforcement Learning"

Transcription

1 State Space Abstraction for Reinforcement Learning Rowan McAllister & Thang Bui MLG, Cambridge Nov 6th, 24 / 26

2 Rowan s introduction 2 / 26

3 Types of abstraction [LWL6] Abstractions are partitioned based on state-action value functions Q(s, a) and Q (s, a) Loosely categorised into five categories / 26

4 Model-irrelevance abstraction φ model Definition φ model {(s ) = φ model (s 2 ) R a = s = R s a 2 s φ model (x) Pa s,s = s φ model (x) Pa s 2,s x, a 4 / 26

5 Model-irrelevance abstraction φ model Definition φ model {(s ) = φ model (s 2 ) R a = s = R s a 2 s φ model (x) Pa s,s = s φ model (x) Pa s 2,s x, a In words, for any action a, ground states in the same abstract class should have the same reward, and have the same transition probability into a new abstract class. 4 / 26

6 Model-irrelevance abstraction φ model Definition φ model {(s ) = φ model (s 2 ) R a = s = R s a 2 s φ model (x) Pa s,s = s φ model (x) Pa s 2,s x, a In words, for any action a, ground states in the same abstract class should have the same reward, and have the same transition probability into a new abstract class. Y= Y= 2 2 X= X= [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] 4 / 26

7 Model-irrelevance abstraction φ model Definition φ model {(s ) = φ model (s 2 ) R a = s = R s a 2 s φ model (x) Pa s,s = s φ model (x) Pa s 2,s x, a In words, for any action a, ground states in the same abstract class should have the same reward, and have the same transition probability into a new abstract class. Y= Y= 2 2 X= X= [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] 4 / 26

8 Types of abstraction [cont d] Definition (Q π -irrelevance abstraction) φ Q π(s ) = φ Q π(s 2 ) = Q π (s, a) = Q π (s 2, a), π, a 5 / 26

9 Types of abstraction [cont d] Definition (Q π -irrelevance abstraction) φ Q π(s ) = φ Q π(s 2 ) = Q π (s, a) = Q π (s 2, a), π, a In words, for any action a, ground states in the same abstract class should have the same state-action value function. 5 / 26

10 Types of abstraction [cont d] Definition (Q π -irrelevance abstraction) φ Q π(s ) = φ Q π(s 2 ) = Q π (s, a) = Q π (s 2, a), π, a In words, for any action a, ground states in the same abstract class should have the same state-action value function. Y= Y= 2 2 [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] X= X= 5 / 26

11 Types of abstraction [cont d] Definition (Q π -irrelevance abstraction) φ Q π(s ) = φ Q π(s 2 ) = Q π (s, a) = Q π (s 2, a), π, a In words, for any action a, ground states in the same abstract class should have the same state-action value function. Y= Y= 2 2 [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] X= X= 5 / 26

12 Types of abstraction [cont d] Definition (Q -irrelevance abstraction) φ Q (s ) = φ Q (s 2 ) = Q (s, a) = Q (s 2, a), a 6 / 26

13 Types of abstraction [cont d] Definition (Q -irrelevance abstraction) φ Q (s ) = φ Q (s 2 ) = Q (s, a) = Q (s 2, a), a In words, for any action a, ground states in the same abstract class should have the same optimal state-action value function, or same optimal policy. 6 / 26

14 Types of abstraction [cont d] Definition (Q -irrelevance abstraction) φ Q (s ) = φ Q (s 2 ) = Q (s, a) = Q (s 2, a), a In words, for any action a, ground states in the same abstract class should have the same optimal state-action value function, or same optimal policy. Y= Y= 2 2 [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] X= X= 6 / 26

15 Types of abstraction [cont d] Definition (Q -irrelevance abstraction) φ Q (s ) = φ Q (s 2 ) = Q (s, a) = Q (s 2, a), a In words, for any action a, ground states in the same abstract class should have the same optimal state-action value function, or same optimal policy. Y= Y= 2 2 [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] X= X= 6 / 26

16 Types of abstraction [cont d] Definition (a -irrelevance abstraction) φ a (s{ ) = φ a (s 2 ) a = a is optimal for s and s 2 Q (s, a ) = max a Q (s, a) = max a Q (s 2, a) = Q (s 2, a ) 7 / 26

17 Types of abstraction [cont d] Definition (a -irrelevance abstraction) φ a (s{ ) = φ a (s 2 ) a = a is optimal for s and s 2 Q (s, a ) = max a Q (s, a) = max a Q (s 2, a) = Q (s 2, a ) In words, ground states in the same abstract class should have the same optimal action, and have the same optimal state-action value function upon taking optimal action. 7 / 26

18 Types of abstraction [cont d] Definition (a -irrelevance abstraction) φ a (s{ ) = φ a (s 2 ) a = a is optimal for s and s 2 Q (s, a ) = max a Q (s, a) = max a Q (s 2, a) = Q (s 2, a ) In words, ground states in the same abstract class should have the same optimal action, and have the same optimal state-action value function upon taking optimal action. Y= Y= 2 2 X= X= [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] 7 / 26

19 Types of abstraction [cont d] Definition (a -irrelevance abstraction) φ a (s{ ) = φ a (s 2 ) a = a is optimal for s and s 2 Q (s, a ) = max a Q (s, a) = max a Q (s 2, a) = Q (s 2, a ) In words, ground states in the same abstract class should have the same optimal action, and have the same optimal state-action value function upon taking optimal action. Y= Y= 2 2 X= X= [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] 7 / 26

20 And even more abstract... Definition (π -irrelevance abstraction) φ π (s ) = φ π (s 2 ) = a a is optimal for s and s 2 8 / 26

21 And even more abstract... Definition (π -irrelevance abstraction) φ π (s ) = φ π (s 2 ) = a a is optimal for s and s 2 In words, ground states in the same abstract class should have the same optimal action. 8 / 26

22 And even more abstract... Definition (π -irrelevance abstraction) φ π (s ) = φ π (s 2 ) = a a is optimal for s and s 2 In words, ground states in the same abstract class should have the same optimal action. Y= Y= 2 2 [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] X= X= 8 / 26

23 And even more abstract... Definition (π -irrelevance abstraction) φ π (s ) = φ π (s 2 ) = a a is optimal for s and s 2 In words, ground states in the same abstract class should have the same optimal action. Y= Y= 2 2 [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] X= X= 8 / 26

24 And even more abstract... Definition (π -irrelevance abstraction) φ π (s ) = φ π (s 2 ) = a a is optimal for s and s 2 In words, ground states in the same abstract class should have the same optimal action. Y= Y= 2 2 [Rewards] [Q: action-val.] [π: Policy] [V: state-val.] X= X= 8 / 26

25 Types of abstraction - An example (a) 4 s / 26

26 Types of abstraction - An example (a) 4 (b).5 s 2 4 s 2 9 / 26

27 Types of abstraction - A comparison Questions: Same optimal policy in both abstract space and ground space? Same optimal state-action value functions? / 26

28 Types of abstraction - A comparison Questions: Same optimal policy in both abstract space and ground space? Same optimal state-action value functions? Type Same optimal policy Q-learning optimality φ model φ Q π φ Q φ a φ π / 26

29 Types of abstraction - A comparison Questions: Same optimal policy in both abstract space and ground space? Same optimal state-action value functions? Type Same optimal policy Q-learning optimality φ model φ Q π φ Q φ a φ π (b).5 a a s 2 / 26

30 Predictive State Representation: An introduction / 26

31 A Zoubin cube for HMM/POMDP/PSR... 2 / 26

32 A Zoubin cube for HMM/POMDP/PSR... 2 / 26

33 Predictive State Representation - Overview A model for controlled dynamical systems / 26

34 Predictive State Representation - Overview A model for controlled dynamical systems Why useful? / 26

35 Predictive State Representation - Overview A model for controlled dynamical systems Why useful? A compact representation compared to k-markov or POMDP models / 26

36 Predictive State Representation - Overview A model for controlled dynamical systems Why useful? A compact representation compared to k-markov or POMDP models x / 26

37 Predictive State Representation - Overview A model for controlled dynamical systems Why useful? A compact representation compared to k-markov or POMDP models x / 26

38 Predictive State Representation - Overview A model for controlled dynamical systems Why useful? A compact representation compared to k-markov or POMDP models x Speedy learning / 26

39 A review of POMDPs 4 / 26

40 A review of POMDPs A POMDP is defined by a tuple S, A, O, T, O, b : S, A, O: sets of hidden states, actions and observations, dim(s) = n T : set of n n transition matrices, one for each action a A O: set of n n diagonal matrices, one for each action-observation pair b : initial belief state, {p(s i )} n i= 4 / 26

41 A review of POMDPs A POMDP is defined by a tuple S, A, O, T, O, b : S, A, O: sets of hidden states, actions and observations, dim(s) = n T : set of n n transition matrices, one for each action a A O: set of n n diagonal matrices, one for each action-observation pair b : initial belief state, {p(s i )} n i= POMDPs use a belief state to encode historical information Belief state at history h: b(s h) = {p(s i h)} n i= 4 / 26

42 A review of POMDPs A POMDP is defined by a tuple S, A, O, T, O, b : S, A, O: sets of hidden states, actions and observations, dim(s) = n T : set of n n transition matrices, one for each action a A O: set of n n diagonal matrices, one for each action-observation pair b : initial belief state, {p(s i )} n i= POMDPs use a belief state to encode historical information Belief state at history h: b(s h) = {p(s i h)} n i= After taking an action a and observing o, the belief state can be updated: b(s hao) = b(s h)t a O a,o b(s h)t a O a,o n 4 / 26

43 A review of POMDPs A POMDP is defined by a tuple S, A, O, T, O, b : S, A, O: sets of hidden states, actions and observations, dim(s) = n T : set of n n transition matrices, one for each action a A O: set of n n diagonal matrices, one for each action-observation pair b : initial belief state, {p(s i )} n i= POMDPs use a belief state to encode historical information Belief state at history h: b(s h) = {p(s i h)} n i= After taking an action a and observing o, the belief state can be updated: b(s hao) = b(s h)t a O a,o b(s h)t a O a,o n PSRs remove hidden states and use probabilities of particular future actions/outcomes to encode historical information. 4 / 26

44 Predictive State Representation - Notations 5 / 26

45 Predictive State Representation - Notations A test or experiment t is a sequence of actions and observations t = a o a 2 o 2 a k o k 5 / 26

46 Predictive State Representation - Notations A test or experiment t is a sequence of actions and observations t = a o a 2 o 2 a k o k Observed history h of length n h = a o a 2 o 2 a n o n 5 / 26

47 Predictive State Representation - Notations A test or experiment t is a sequence of actions and observations t = a o a 2 o 2 a k o k Observed history h of length n h = a o a 2 o 2 a n o n Test prediction given history p(t h) = prob(o n+ = o,, o n+k = o k h, a n+ = a,, a n+k = a k ) 5 / 26

48 Predictive State Representation - Notations A test or experiment t is a sequence of actions and observations t = a o a 2 o 2 a k o k Observed history h of length n h = a o a 2 o 2 a n o n Test prediction given history p(t h) = prob(o n+ = o,, o n+k = o k h, a n+ = a,, a n+k = a k ) Set of m core tests T T = {t, t 2,, t m} p(t h) = {p(t h), p(t 2 h),, p(t m h)} 5 / 26

49 Predictive State Representation - Definition 6 / 26

50 Predictive State Representation - Definition Definition ([LSS2]) A set T is a PSR iff its prediction vector forms a sufficient statistic for the dynamical system, or for any test t and history h. p(t h) = f t (p(t h)), 6 / 26

51 Predictive State Representation - Definition Definition ([LSS2]) A set T is a PSR iff its prediction vector forms a sufficient statistic for the dynamical system, or for any test t and history h. p(t h) = f t (p(t h)), Mapping f t (.) could be linear or non-linear Linear f t (.) means for all tests t, w t s.t. p(t h) = p (T h)w t 6 / 26

52 Predictive State Representation - Definition Definition ([LSS2]) A set T is a PSR iff its prediction vector forms a sufficient statistic for the dynamical system, or for any test t and history h. p(t h) = f t (p(t h)), Mapping f t (.) could be linear or non-linear Linear f t (.) means for all tests t, w t s.t. p(t h) = p (T h)w t Updating p(t h) upon taking action a and observing o p(t i hao) = p(aot i h) p(ao h) = f aoti (p(t h)) f ao (p(t h)) = p (T h)w aot i p (T h)w ao 6 / 26

53 Predictive State Representation - Why use PSR? 7 / 26

54 Predictive State Representation - Why use PSR? Theorem ([LSS2]) For any environment that can be represented by a finite POMDP model, there exists a linear PSR with number of tests m no larger than the number of states in the minimal POMDP model. 7 / 26

55 Predictive State Representation - Why use PSR? Theorem ([LSS2]) For any environment that can be represented by a finite POMDP model, there exists a linear PSR with number of tests m no larger than the number of states in the minimal POMDP model. 7 / 26

56 Predictive State Representation - Why use PSR? Theorem ([LSS2]) For any environment that can be represented by a finite POMDP model, there exists a linear PSR with number of tests m no larger than the number of states in the minimal POMDP model. POMDP: 5 states Linear PSR: T = {r, f r, f f r, f f f r, f f f f r} Nonlinear PSR: T = {r, f r} 7 / 26

57 Predictive State Representation - A system dynamics matrix System dynamics matrix D specifies predictions of any test given any history. D t t i h p(t h ) p(t i h ) h p(t h ) p(t i h ). h j p(t h j ) p(t i h j ). 8 / 26

58 Predictive State Representation - A system dynamics matrix System dynamics matrix D specifies predictions of any test given any history. D t t i h p(t h ) p(t i h ) h p(t h ) p(t i h ). h j p(t h j ) p(t i h j ). D has an infinite number of columns and an infinite number of rows! 8 / 26

59 Predictive State Representation - A system dynamics matrix System dynamics matrix D specifies predictions of any test given any history. D t t i h p(t h ) p(t i h ) h p(t h ) p(t i h ). h j p(t h j ) p(t i h j ). D has an infinite number of columns and an infinite number of rows! Let T be the set of linearly independent columns of D, then T is a linear PSR, as D(:, t) = D(:, T )w t or p(t h) = p(t h)w t 8 / 26

60 Predictive State Representation - A system dynamics matrix System dynamics matrix D specifies predictions of any test given any history. D t t i h p(t h ) p(t i h ) h p(t h ) p(t i h ). h j p(t h j ) p(t i h j ). D has an infinite number of columns and an infinite number of rows! Let T be the set of linearly independent columns of D, then T is a linear PSR, as D(:, t) = D(:, T )w t or p(t h) = p(t h)w t Define H be the set of linearly independent rows of D, then H forms a set of core histories 8 / 26

61 Predictive State Representation - How to discover T [JS4] 9 / 26

62 Predictive State Representation - How to discover T [JS4] Data: Matrix D Result: T = {t, t 2,, t m} Initialisation: v = ; T = ; H = ; while true do Form one step extension matrix of D H,T [ ] DH,T D D H,T = H,T + D H+,T D H+,T + T linearly independent columns of D H,T H linearly independent rows of D H,T v i rank of D H,T if Stopping condition: v i == v i then stop end end 9 / 26

63 Predictive State Representation - How to discover T [JS4] Data: Matrix D Result: T = {t, t 2,, t m} Initialisation: v = ; T = ; H = ; while true do Form one step extension matrix of D H,T [ ] DH,T D D H,T = H,T + D H+,T D H+,T + T linearly independent columns of D H,T H linearly independent rows of D H,T v i rank of D H,T if Stopping condition: v i == v i then stop end end Caveats: D cannot be computed exactly in general, hence sampling is needed. T may not be complete, as v i == v i is too strict. 9 / 26

64 Predictive State Representation - How to learn w t in linear PSRs [JS4] 2 / 26

65 Predictive State Representation - How to learn w t in linear PSRs [JS4] Recall: p(aot i h) = f aot i (p(t h)) = p(t h)w aot i p(ao h) = f ao (p(t h)) = p(t h)w ao 2 / 26

66 Predictive State Representation - How to learn w t in linear PSRs [JS4] Recall: p(aot i h) = f aot i (p(t h)) = p(t h)w aot i p(ao h) = f ao (p(t h)) = p(t h)w ao Note that w t does not depend on h, hence p(aot i H) = p(t H)w aot i p(ao H) = p(t H)w ao 2 / 26

67 Predictive State Representation - How to learn w t in linear PSRs [JS4] Recall: p(aot i h) = f aot i (p(t h)) = p(t h)w aot i p(ao h) = f ao (p(t h)) = p(t h)w ao Note that w t does not depend on h, hence p(aoti H) = p(t H)w aot i p(ao H) = p(t H)w ao As dim(t ) = dim(h) and rows in p(t H) are linearly independent, w aot i = p (T H)p(aot i H) w ao = p (T H)p(ao H) 2 / 26

68 Predictive State Representation - Planning PSRs can capture the regularities of the environment, represent the environment is a more compact way, speed up learning process. 2 / 26

69 Predictive State Representation - Planning PSRs can capture the regularities of the environment, represent the environment is a more compact way, speed up learning process. An example from [RRST5]: G 2 / 26

70 Predictive State Representation - Learning rate [RRST5] Steps n =4 n =5 n =6 k=2 k= Environmental State k=4 k=6 k=5 n = 42.2 Episodes 22 / 26

71 Predictive State Representation - Learning rate [RRST5] Steps n =4 n =5 n =6 k=2 k= Environmental State k=4 k=6 k=5 n = 42.2 Episodes 22 / 26

72 Predictive State Representation - Learning rate [RRST5] Steps n =4 n =5 n =6 k=2 k= Environmental State k=4 k=6 k=5 n = 42.2 Episodes 22 / 26

73 Predictive State Representation - What s next? For continuous observations: HMM LDS PSR Predictive Linear-Gaussian [RS6] ioom/ipsr: set of core tests T grows while exploring the environment? 2 / 26

74 Summary State space abstraction can provide state space generalisations, can speed up learning/planning in reinforcement learning, is an alternative to value function approximations. Predictive state representations as an alternative to POMDPs model systems using only observable quantities and no hidden states are compact, but hard to learn 24 / 26

75 References Michael R James and Satinder Singh, Learning and discovery of predictive state representations in dynamical systems with reset, Proceedings of the 2st International Conference on Machine learning, ACM, 24, p. 5. Michael Littman, Richard Sutton, and Satinder Singh, Predictive representations of state, Advances in Neural Information Processing Systems 4 (NIPS), 22, pp Lihong Li, Thomas J Walsh, and Michael L Littman, Towards a unified theory of state abstraction for MDPs., ISAIM, 26. Eddie J Rafols, Mark B Ring, Richard S Sutton, and Brian Tanner, Using predictive representations to improve generalization in reinforcement learning., IJCAI, 25, pp Matthew Rudary and Satinder Singh, Predictive linear-gaussian models of controlled stochastic dynamical systems, Proceedings of the 2rd International Conference on Machine learning, ACM, 26, pp / 26

76 Thanks!

A Nonlinear Predictive State Representation

A Nonlinear Predictive State Representation Draft: Please do not distribute A Nonlinear Predictive State Representation Matthew R. Rudary and Satinder Singh Computer Science and Engineering University of Michigan Ann Arbor, MI 48109 {mrudary,baveja}@umich.edu

More information

State Space Abstractions for Reinforcement Learning

State Space Abstractions for Reinforcement Learning State Space Abstractions for Reinforcement Learning Rowan McAllister and Thang Bui MLG RCC 6 November 24 / 24 Outline Introduction Markov Decision Process Reinforcement Learning State Abstraction 2 Abstraction

More information

Combining Memory and Landmarks with Predictive State Representations

Combining Memory and Landmarks with Predictive State Representations Combining Memory and Landmarks with Predictive State Representations Michael R. James and Britton Wolfe and Satinder Singh Computer Science and Engineering University of Michigan {mrjames, bdwolfe, baveja}@umich.edu

More information

On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets

On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets Pablo Samuel Castro pcastr@cs.mcgill.ca McGill University Joint work with: Doina Precup and Prakash

More information

Planning in POMDPs Using Multiplicity Automata

Planning in POMDPs Using Multiplicity Automata Planning in POMDPs Using Multiplicity Automata Eyal Even-Dar School of Computer Science Tel-Aviv University Tel-Aviv, Israel 69978 evend@post.tau.ac.il Sham M. Kakade Computer and Information Science University

More information

Learning Low Dimensional Predictive Representations

Learning Low Dimensional Predictive Representations Learning Low Dimensional Predictive Representations Matthew Rosencrantz MROSEN@CS.CMU.EDU Computer Science Department, Carnegie Mellon University, Forbes Avenue, Pittsburgh, PA, USA Geoff Gordon GGORDON@CS.CMU.EDU

More information

Planning with Predictive State Representations

Planning with Predictive State Representations Planning with Predictive State Representations Michael R. James University of Michigan mrjames@umich.edu Satinder Singh University of Michigan baveja@umich.edu Michael L. Littman Rutgers University mlittman@cs.rutgers.edu

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Predictive Timing Models

Predictive Timing Models Predictive Timing Models Pierre-Luc Bacon McGill University pbacon@cs.mcgill.ca Borja Balle McGill University bballe@cs.mcgill.ca Doina Precup McGill University dprecup@cs.mcgill.ca Abstract We consider

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

Learning in Zero-Sum Team Markov Games using Factored Value Functions

Learning in Zero-Sum Team Markov Games using Factored Value Functions Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer

More information

Machine Learning I Continuous Reinforcement Learning

Machine Learning I Continuous Reinforcement Learning Machine Learning I Continuous Reinforcement Learning Thomas Rückstieß Technische Universität München January 7/8, 2010 RL Problem Statement (reminder) state s t+1 ENVIRONMENT reward r t+1 new step r t

More information

Modeling Dynamical Systems with Structured Predictive State Representations

Modeling Dynamical Systems with Structured Predictive State Representations Modeling Dynamical Systems with Structured Predictive State Representations by Britton D. Wolfe A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

Average Reward Optimization Objective In Partially Observable Domains

Average Reward Optimization Objective In Partially Observable Domains Average Reward Optimization Objective In Partially Observable Domains Yuri Grinberg School of Computer Science, McGill University, Canada Doina Precup School of Computer Science, McGill University, Canada

More information

Temporal difference learning

Temporal difference learning Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).

More information

Modeling Multiple-mode Systems with Predictive State Representations

Modeling Multiple-mode Systems with Predictive State Representations Modeling Multiple-mode Systems with Predictive State Representations Britton Wolfe Computer Science Indiana University-Purdue University Fort Wayne wolfeb@ipfw.edu Michael R. James AI and Robotics Group

More information

Policy-Gradients for PSRs and POMDPs

Policy-Gradients for PSRs and POMDPs Policy-Gradients for PSRs and POMDPs Douglas Aberdeen doug.aberdeen@anu.edu.au Olivier Buffet olivier.buffet@nicta.com.au National ICT Australia Locked Bag 8001, Canberra Australia Owen Thomas owen.thomas@nicta.com.au

More information

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department Dialogue management: Parametric approaches to policy optimisation Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department 1 / 30 Dialogue optimisation as a reinforcement learning

More information

Least squares policy iteration (LSPI)

Least squares policy iteration (LSPI) Least squares policy iteration (LSPI) Charles Elkan elkan@cs.ucsd.edu December 6, 2012 1 Policy evaluation and policy improvement Let π be a non-deterministic but stationary policy, so p(a s; π) is the

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Partially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague

Partially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague Partially observable Markov decision processes Jiří Kléma Department of Computer Science, Czech Technical University in Prague https://cw.fel.cvut.cz/wiki/courses/b4b36zui/prednasky pagenda Previous lecture:

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

An Introduction to Markov Decision Processes. MDP Tutorial - 1

An Introduction to Markov Decision Processes. MDP Tutorial - 1 An Introduction to Markov Decision Processes Bob Givan Purdue University Ron Parr Duke University MDP Tutorial - 1 Outline Markov Decision Processes defined (Bob) Objective functions Policies Finding Optimal

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

Learning to Make Predictions In Partially Observable Environments Without a Generative Model

Learning to Make Predictions In Partially Observable Environments Without a Generative Model Journal of Artificial Intelligence Research 42 (2011) 353-392 Submitted 5/11; published 11/11 Learning to Make Predictions In Partially Observable Environments Without a Generative Model Erik Talvitie

More information

State Space Compression with Predictive Representations

State Space Compression with Predictive Representations State Space Compression with Predictive Representations Abdeslam Boularias Laval University Quebec GK 7P4, Canada Masoumeh Izadi McGill University Montreal H3A A3, Canada Brahim Chaib-draa Laval University

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

Temporal Difference Learning & Policy Iteration

Temporal Difference Learning & Policy Iteration Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Final Exam December 12, 2017

Final Exam December 12, 2017 Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes

More information

Reinforcement Learning and Deep Reinforcement Learning

Reinforcement Learning and Deep Reinforcement Learning Reinforcement Learning and Deep Reinforcement Learning Ashis Kumer Biswas, Ph.D. ashis.biswas@ucdenver.edu Deep Learning November 5, 2018 1 / 64 Outlines 1 Principles of Reinforcement Learning 2 The Q

More information

Bias-Variance Error Bounds for Temporal Difference Updates

Bias-Variance Error Bounds for Temporal Difference Updates Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds

More information

Final Exam December 12, 2017

Final Exam December 12, 2017 Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes

More information

Topics of Active Research in Reinforcement Learning Relevant to Spoken Dialogue Systems

Topics of Active Research in Reinforcement Learning Relevant to Spoken Dialogue Systems Topics of Active Research in Reinforcement Learning Relevant to Spoken Dialogue Systems Pascal Poupart David R. Cheriton School of Computer Science University of Waterloo 1 Outline Review Markov Models

More information

Efficient Learning in Linearly Solvable MDP Models

Efficient Learning in Linearly Solvable MDP Models Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Efficient Learning in Linearly Solvable MDP Models Ang Li Department of Computer Science, University of Minnesota

More information

Artificial Intelligence & Sequential Decision Problems

Artificial Intelligence & Sequential Decision Problems Artificial Intelligence & Sequential Decision Problems (CIV6540 - Machine Learning for Civil Engineers) Professor: James-A. Goulet Département des génies civil, géologique et des mines Chapter 15 Goulet

More information

Policy Gradient Methods. February 13, 2017

Policy Gradient Methods. February 13, 2017 Policy Gradient Methods February 13, 2017 Policy Optimization Problems maximize E π [expression] π Fixed-horizon episodic: T 1 Average-cost: lim T 1 T r t T 1 r t Infinite-horizon discounted: γt r t Variable-length

More information

Approximate Predictive State Representations

Approximate Predictive State Representations Approximate Predictive State Representations Britton Wolfe Computer Science and Engineering University of Michigan Ann Arbor, MI 48109 bdwolfe@umich.edu Michael R. James AI and Robotics Group Toyota Technical

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Abstraction in Predictive State Representations

Abstraction in Predictive State Representations Abstraction in Predictive State Representations Vishal Soni and Satinder Singh Computer Science and Engineering University of Michigan, Ann Arbor {soniv,baveja}@umich.edu Abstract Most work on Predictive

More information

Online regret in reinforcement learning

Online regret in reinforcement learning University of Leoben, Austria Tübingen, 31 July 2007 Undiscounted online regret I am interested in the difference (in rewards during learning) between an optimal policy and a reinforcement learner: T T

More information

Partially Observable Markov Decision Processes (POMDPs)

Partially Observable Markov Decision Processes (POMDPs) Partially Observable Markov Decision Processes (POMDPs) Geoff Hollinger Sequential Decision Making in Robotics Spring, 2011 *Some media from Reid Simmons, Trey Smith, Tony Cassandra, Michael Littman, and

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396 Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction

More information

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Introduction to Reinforcement Learning Part 1: Markov Decision Processes Introduction to Reinforcement Learning Part 1: Markov Decision Processes Rowan McAllister Reinforcement Learning Reading Group 8 April 2015 Note I ve created these slides whilst following Algorithms for

More information

Toward Good Abstractions for Lifelong Learning

Toward Good Abstractions for Lifelong Learning Toward Good Abstractions for Lifelong Learning David Abel, Dilip Arumugam, Lucas Lehnert, Michael L. Littman Department of Computer Science Brown University Providence, RI 02906 {david abel,dilip arumugam,lucas

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour

More information

Planning in Markov Decision Processes

Planning in Markov Decision Processes Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.

More information

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Markov Decision Processes and Solving Finite Problems. February 8, 2017 Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:

More information

Reinforcement Learning as Classification Leveraging Modern Classifiers

Reinforcement Learning as Classification Leveraging Modern Classifiers Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

Chapter 16 Planning Based on Markov Decision Processes

Chapter 16 Planning Based on Markov Decision Processes Lecture slides for Automated Planning: Theory and Practice Chapter 16 Planning Based on Markov Decision Processes Dana S. Nau University of Maryland 12:48 PM February 29, 2012 1 Motivation c a b Until

More information

Hilbert Space Embeddings of Predictive State Representations

Hilbert Space Embeddings of Predictive State Representations Hilbert Space Embeddings of Predictive State Representations Byron Boots Computer Science and Engineering Dept. University of Washington Seattle, WA Arthur Gretton Gatsby Unit University College London

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Planning and Acting in Partially Observable Stochastic Domains

Planning and Acting in Partially Observable Stochastic Domains Planning and Acting in Partially Observable Stochastic Domains Leslie Pack Kaelbling*, Michael L. Littman**, Anthony R. Cassandra*** *Computer Science Department, Brown University, Providence, RI, USA

More information

Spectral Learning of Predictive State Representations with Insufficient Statistics

Spectral Learning of Predictive State Representations with Insufficient Statistics Spectral Learning of Predictive State Representations with Insufficient Statistics Alex Kulesza and Nan Jiang and Satinder Singh Computer Science & Engineering University of Michigan Ann Arbor, MI, USA

More information

Logic, Knowledge Representation and Bayesian Decision Theory

Logic, Knowledge Representation and Bayesian Decision Theory Logic, Knowledge Representation and Bayesian Decision Theory David Poole University of British Columbia Overview Knowledge representation, logic, decision theory. Belief networks Independent Choice Logic

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Formal models of interaction Daniel Hennes 27.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Taxonomy of domains Models of

More information

The Markov Decision Process Extraction Network

The Markov Decision Process Extraction Network The Markov Decision Process Extraction Network Siegmund Duell 1,2, Alexander Hans 1,3, and Steffen Udluft 1 1- Siemens AG, Corporate Research and Technologies, Learning Systems, Otto-Hahn-Ring 6, D-81739

More information

Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks

Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks Creighton Heaukulani and Zoubin Ghahramani University of Cambridge TU Denmark, June 2013 1 A Network Dynamic network data

More information

POMDPs and Policy Gradients

POMDPs and Policy Gradients POMDPs and Policy Gradients MLSS 2006, Canberra Douglas Aberdeen Canberra Node, RSISE Building Australian National University 15th February 2006 Outline 1 Introduction What is Reinforcement Learning? Types

More information

CAP Plan, Activity, and Intent Recognition

CAP Plan, Activity, and Intent Recognition CAP6938-02 Plan, Activity, and Intent Recognition Lecture 10: Sequential Decision-Making Under Uncertainty (part 1) MDPs and POMDPs Instructor: Dr. Gita Sukthankar Email: gitars@eecs.ucf.edu SP2-1 Reminder

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@csa.iisc.ernet.in Department of Computer Science and Automation Indian Institute of Science August 2014 What is Reinforcement

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Policies Generalization in Reinforcement Learning using Galois Partitions Lattices

Policies Generalization in Reinforcement Learning using Galois Partitions Lattices Policies Generalization in Reinforcement Learning using Galois Partitions Lattices Marc Ricordeau and Michel Liquière mricorde@wanadoo.fr, liquiere@lirmm.fr Laboratoire d Informatique, de Robotique et

More information

Multiagent Value Iteration in Markov Games

Multiagent Value Iteration in Markov Games Multiagent Value Iteration in Markov Games Amy Greenwald Brown University with Michael Littman and Martin Zinkevich Stony Brook Game Theory Festival July 21, 2005 Agenda Theorem Value iteration converges

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms

More information

Markov decision processes (MDP) CS 416 Artificial Intelligence. Iterative solution of Bellman equations. Building an optimal policy.

Markov decision processes (MDP) CS 416 Artificial Intelligence. Iterative solution of Bellman equations. Building an optimal policy. Page 1 Markov decision processes (MDP) CS 416 Artificial Intelligence Lecture 21 Making Complex Decisions Chapter 17 Initial State S 0 Transition Model T (s, a, s ) How does Markov apply here? Uncertainty

More information

REINFORCEMENT LEARNING

REINFORCEMENT LEARNING REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents

More information

(Deep) Reinforcement Learning

(Deep) Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Bayesian reinforcement learning and partially observable Markov decision processes November 6, / 24

Bayesian reinforcement learning and partially observable Markov decision processes November 6, / 24 and partially observable Markov decision processes Christos Dimitrakakis EPFL November 6, 2013 Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 1 / 24

More information

Optimism in the Face of Uncertainty Should be Refutable

Optimism in the Face of Uncertainty Should be Refutable Optimism in the Face of Uncertainty Should be Refutable Ronald ORTNER Montanuniversität Leoben Department Mathematik und Informationstechnolgie Franz-Josef-Strasse 18, 8700 Leoben, Austria, Phone number:

More information

Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning

Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning Pascal Poupart (University of Waterloo) INFORMS 2009 1 Outline Dynamic Pricing as a POMDP Symbolic Perseus

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of

More information

Real Time Value Iteration and the State-Action Value Function

Real Time Value Iteration and the State-Action Value Function MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing

More information

Towards a Unified Theory of State Abstraction for MDPs

Towards a Unified Theory of State Abstraction for MDPs Towards a Unified Theory of State Abstraction for MDPs Lihong Li Thomas J. Walsh Michael L. Littman Department of Computer Science Rutgers University, Piscataway, NJ 08854 {lihong,thomaswa,mlittman}@cs.rutgers.edu

More information

16.410/413 Principles of Autonomy and Decision Making

16.410/413 Principles of Autonomy and Decision Making 16.410/413 Principles of Autonomy and Decision Making Lecture 23: Markov Decision Processes Policy Iteration Emilio Frazzoli Aeronautics and Astronautics Massachusetts Institute of Technology December

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018 What is Reinforcement

More information

A Predictive Model for Imitation Learning in Partially Observable Environments

A Predictive Model for Imitation Learning in Partially Observable Environments A Predictive Model for Imitation Learning in Partially Observable Environments Abdeslam Boularias Department of Computer Science, Laval University, Canada, G1K 7P4 boularias@damas.ift.ulaval.ca Abstract

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

Recurrent Gradient Temporal Difference Networks

Recurrent Gradient Temporal Difference Networks Recurrent Gradient Temporal Difference Networks David Silver Department of Computer Science, CSML, University College London London, WC1E 6BT d.silver@cs.ucl.ac.uk Abstract Temporal-difference (TD) networks

More information

RL 14: Simplifications of POMDPs

RL 14: Simplifications of POMDPs RL 14: Simplifications of POMDPs Michael Herrmann University of Edinburgh, School of Informatics 04/03/2016 POMDPs: Points to remember Belief states are probability distributions over states Even if computationally

More information

Miscellaneous Results, Solving Equations, and Generalized Inverses. opyright c 2012 Dan Nettleton (Iowa State University) Statistics / 51

Miscellaneous Results, Solving Equations, and Generalized Inverses. opyright c 2012 Dan Nettleton (Iowa State University) Statistics / 51 Miscellaneous Results, Solving Equations, and Generalized Inverses opyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 1 / 51 Result A.7: Suppose S and T are vector spaces. If S T and

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information