1 / 58 An Introduction to Reinforcement Learning Lecture 01: Introduction Dr. Johannes A. Stork School of Computer Science and Communication KTH Royal Institute of Technology January 19, 2017
2 / 58
../fig/reward-00.jpg 3 / 58
../fig/reward-01.jpg 3 / 58
../fig/reward-03.jpg 3 / 58
../fig/reward-04.jpg 3 / 58
../fig/marshmellow-01.jpg 4 / 58
Play video 5 / 58
6 / 58 Marshmallow Experiment and Delayed Gratification SAT scores Mischel, Walter; Shoda, Yuichi; Rodriguzez, Monica L. (1989). Delay of gratification in children.. Science. 244: 933 938. Educational attainment Ayduk, Ozlem N.; Mendoa-Denton, Rodolfo; Mischel, Walter; Downey, Geraldine; Peake, Philip K.; Rodriguez, Monica L. (2000). Regulating the interpersonal self: Strategic self-regulation for coping with rejection sensitivity. Journal of Personality and Social Psychology. 79 (5): 776 792. Body mass index Schlam, Tanya R.; Wilson, Nicole L.; Shoda, Yuichi; Mischel, Walter; Ayduk, Ozlem (2013). Preschoolers delay of gratification predicts their body mass 30 years later. The Journal of Pediatrics. 162: 90 93. Other life measures Shoda, Yuichi; Mischel, Walter; Peake, Philip K. (1990). Predicting Adolescent Cognitive and Self-Regulatory Competencies from Preschool Delay of Gratification: Identifying Diagnostic Conditions. Developmental Psychology. 26 (6): 978 986.
7 / 58 Table of Contents Today s Goal Reinforcement Learning Problems & Agents Intuition Terminology History and Structure of the Field Formalization Informal Examples Dynamic Programming Monte Carlo Symmary Material
8 / 58 Today s Goal Intuition RL problems RL agents Terminology History Formalization Examples
9 / 58 Table of Contents Today s Goal Reinforcement Learning Problems & Agents Intuition Terminology History and Structure of the Field Formalization Informal Examples Dynamic Programming Monte Carlo Symmary Material
10 / 58 Table of Contents Today s Goal Reinforcement Learning Problems & Agents Intuition Terminology History and Structure of the Field Formalization Informal Examples Dynamic Programming Monte Carlo Symmary Material
11 / 58 What is Reinforcement Learning? Goal-directed learning...... from interaction with the environment. Learn how to map situations to actions...... in order to maximize some reward.
11 / 58 What is Reinforcement Learning? Goal-directed learning... want something... from interaction with the environment. actions change collect new experience Learn how to map situations to actions... encode & improve behavior... in order to maximize some reward. goal is implicitly given some (external) signal
12 / 58 Characteristics of RL Problems 1. Actions influence later inputs (i.e. closed-loop) 2. No direct instructions (only reward signal) 3. Consequences of actions play out over (long) time (horizon)
13 / 58 Winning in Car Racing../fig/f1.jpg 1. Actions influence later inputs (i.e. closed-loop) 2. No direct instructions (only reward signal) 3. Consequences of actions play out over (long) time (horizon)
13 / 58 Winning in Car Racing../fig/f1.jpg 1. Switch gear, break, steer 2. Win! (Don t crash!, Stay on the track!, Respect Safety Car!,... ) 3. Pit stop timing, choice of tire
14 / 58 Agents in Reinforcement Learning 1. Sense state of environment 2. Take action that affects state 3. Have a goal relating to the state of the environment../fig/agent.png An agent is anything that can be viewed as perceiving its environment through sensors and acting upon that environment through effectors. Russell and Norvig [RN03, pp. 32, 33]
15 / 58 Comparing to Supervised and Unsupervised Learning Supervised Learning Unsupervised Learning../fig/slvsul.jpg
Comparing to Supervised and Unsupervised Learning Supervised Learning (e.g. classification, regression, ranking) Labeled examples from supervisor E.g. {(situation, action) i } i Learning task: generalize & extrapolate Unsupervised Learning (e.g. clustering, segmentation, dimensionality reduction) Unlabeled examples Learning task: find hidden structure 15 / 58
Comparing to Supervised and Unsupervised Learning Supervised Learning (e.g. classification, regression, ranking) Labeled examples from supervisor E.g. {(situation, action) i } i Learning task: generalize & extrapolate Not learning from interaction (In RL we cannot sample) Unsupervised Learning (e.g. clustering, segmentation, dimensionality reduction) Unlabeled examples Learning task: find hidden structure RL tries to maximize reward 15 / 58
16 / 58 Exploration and Exploitation../fig/cow_small.jpg Maximize reward exploit knowledge about rewarding actions Discover reward maximizing actions explore new actions
16 / 58 Exploration and Exploitation../fig/cow_small.jpg Maximize reward exploit knowledge about rewarding actions Discover reward maximizing actions explore new actions Need for tradeoff (Not in Supervised or Unsupervised Learning)
17 / 58 Exploration and Exploitation in Car Racing../fig/f1.jpg Exploitation:? Exploration:?
17 / 58 Exploration and Exploitation in Car Racing../fig/f1_crash.jpg Exploitation:? Exploration:? Both exploration and exploitation can fail
17 / 58 Exploration and Exploitation in Car Racing../fig/f1_crash.jpg Exploitation:? Exploration:? Both exploration and exploitation can fail but we can learn from failing too
18 / 58 Challenges in Reinforcement Learning../fig/f1.jpg Interaction with environment Uncertainty of situation (perception & state estimation) Delayed consequences (requires foresight & planning) Effects of actions cannot fully be predicted (stochastic environment) Measuring goal for (immediate) reward signal
18 / 58 Challenges in Reinforcement Learning Neurogammon, TD-Gammon: World champion level [TS89; Tes95] Interaction with environment Uncertainty of situation (perception & state estimation) Delayed consequences (requires foresight & planning) Effects of actions cannot fully be predicted (stochastic environment) Measuring goal for (immediate) reward signal
18 / 58 Challenges in Reinforcement Learning Neural Controller: Within a small number of trails [Rie05; MLR12] Interaction with environment Uncertainty of situation (perception & state estimation) Delayed consequences (requires foresight & planning) Effects of actions cannot fully be predicted (stochastic environment) Measuring goal for (immediate) reward signal
18 / 58 Challenges in Reinforcement Learning Cooperative RL Agents [Rie+00; LR04] Interaction with environment Uncertainty of situation (perception & state estimation) Delayed consequences (requires foresight & planning) Effects of actions cannot fully be predicted (stochastic environment) Measuring goal for (immediate) reward signal
18 / 58 Challenges in Reinforcement Learning../fig/robotchess.jpg../fig/robotsoccer.jpg Interaction with environment Uncertainty of situation (perception & state estimation) Delayed consequences (requires foresight & planning) Effects of actions cannot fully be predicted (stochastic environment) Measuring goal for (immediate) reward signal
19 / 58 Elements of Reinforcement Learning Policy Reward signal Value function Model 1 Reward and value don t have to agree
19 / 58 Elements of Reinforcement Learning Policy Defines behavior at a given time Similar to stimulus-response rule Reward signal Value function Model 1 Reward and value don t have to agree
19 / 58 Elements of Reinforcement Learning Policy Defines behavior at a given time Similar to stimulus-response rule Reward signal Defines goal/ what is to be achieved (indirectly) Agent wants to maximize total reward over running time Immediate feedback for situation Value function Model 1 Reward and value don t have to agree
19 / 58 Elements of Reinforcement Learning Policy Defines behavior at a given time Similar to stimulus-response rule Reward signal Defines goal/ what is to be achieved (indirectly) Agent wants to maximize total reward over running time Immediate feedback for situation Value function What is good in the long run E.g. total reward until end of interaction More foresighted evaluation of situation Model 1 Reward and value don t have to agree
19 / 58 Elements of Reinforcement Learning Policy Defines behavior at a given time Similar to stimulus-response rule Reward signal Defines goal/ what is to be achieved (indirectly) Agent wants to maximize total reward over running time Immediate feedback for situation Value function What is good in the long run E.g. total reward until end of interaction More foresighted evaluation of situation Main problem: efficiently estimating values Model 1 Reward and value don t have to agree
19 / 58 Elements of Reinforcement Learning Policy Defines behavior at a given time Similar to stimulus-response rule Reward signal Defines goal/ what is to be achieved (indirectly) Agent wants to maximize total reward over running time Immediate feedback for situation Value function What is good in the long run E.g. total reward until end of interaction More foresighted evaluation of situation Main problem: efficiently estimating values Model Mimics behavior of environment Allows inference Model-based vs. model-free 1 Reward and value don t have to agree
20 / 58 Examples../fig/f1.jpg../fig/robotchess.jpg../fig/robotsoccer.jp What is are good policy, reward signal, value function, model?
21 / 58 Optimization and Optimality Agent tries to maximize reward (i.e. optimization)
21 / 58 Optimization and Optimality Agent tries to maximize reward (i.e. optimization) Optimality might be impossible (theoretically or practically)
22 / 58 Table of Contents Today s Goal Reinforcement Learning Problems & Agents Intuition Terminology History and Structure of the Field Formalization Informal Examples Dynamic Programming Monte Carlo Symmary Material
23 / 58 Three Different Threads 1. Learning by trail and error 2. Optimal control 3. Temporal-difference methods Joined in 1980 s to from modern Reinforcement Learning.
24 / 58 Learning by Trail and Error Concepts of animal behavior (since 1850 s) Law of effect The greater satisfaction or discomfort, the greater strengthening or weakening of bond (between situation and feedback) (Thorndike) Reinforcement Strengthening of a pattern of behavior as a result of stimulus (Pavlov 1927) In Computational Intelligence Pleasure-pain system (Turing 1948) Record connections between configurations based on feedback Electro-mechanical machines Find path in mazes
25 / 58 Optimal Control Designing a controller to minimize a measure of a dynamical system over time (since 1950 s) Use Bellman equation and dynamic programming Formalization Markovian decision process Dynamic programming Efficient tabulation algorithms Regarded as only feasible way to solve general stochastic optimal control problems Curse of dimensionality (computation grows exponentially in number of variables)
25 / 58 Optimal Control Designing a controller to minimize a measure of a dynamical system over time (since 1950 s) Use Bellman equation and dynamic programming Formalization Markovian decision process Dynamic programming Efficient tabulation algorithms Regarded as only feasible way to solve general stochastic optimal control problems Curse of dimensionality (computation grows exponentially in number of variables)
26 / 58 Temporal-difference Methods Driven by difference between temporally successive estimates of same quantity Probability of winning a game Value of a state Unique to Reinforcement Learning TD Q learning SARSA Eligibility traces, TD(λ), Q(λ), SARSA(λ)
27 / 58 Structure of the Field of Reinforcement Learning 1 Small problem Large problem Tabular method Approximation method Finite MDP Bandit problem Value function Policy Bellman equations Policy gradient Dynamic Programming Monte Carlo Temporaldifference On-policy Off-policy (Only half the truth.) Eligibility traces 1 As in S&B, 2017
Classification of Problems and Environments Property Tabular method Approximative method DP. MC. TD.... Problem Stationary [yes/ no] State space [cont./ dis.] dis. dis. dis. Action space [cont./ dis.] dis. dis. dis. State aliasing [yes/ no] no no no Observable [yes/ no] yes yes yes Feedback [instr./ eval.] eval. eval. eval. eval. Associative [yes/ no] yes yes yes yes Return [epis./ cont./ dis.] Value at [action, state, after]. (More on p. 405 in S&B, 2017) 28 / 58
29 / 58 Classification of Methods Property Tabular method Approximative method DP. MC. TD.... Method Solution [approx./ exact] exact Incremental + - + Analysis + + - Model-based yes no no Learning [off/ on policy] Action selection/ exploration On-policy Off-policy Bootstraping Uses actual experience Backup type all actual Value function exact exact exact approx. Policy exact exact exact approx. or exact. (More on p. 405 in S&B, 2017)
29 / 58 Classification of Methods Property Tabular method Approximative method DP. MC. TD.... Method Solution [approx./ exact] exact Incremental + - + Analysis + + - Model-based yes no no Learning [off/ on policy] Action selection/ exploration On-policy Off-policy Bootstraping Uses actual experience Backup type all actual Value function exact exact exact approx. Policy exact exact exact approx. or exact. (More on p. 405 in S&B, 2017)
30 / 58 Backup Diagrams 2 2 From S&B, 2017
31 / 58 Table of Contents Today s Goal Reinforcement Learning Problems & Agents Intuition Terminology History and Structure of the Field Formalization Informal Examples Dynamic Programming Monte Carlo Symmary Material
32 / 58 Path Finding in Grid-based Maze G A Four deterministic motions (up, down, left, right) Free space, walls, position observable Goal position (G)
33 / 58 Interaction Between Agent and Environment G S&B, 2017 p. 48 Time steps t = 0, 1, 2,... Sense state S t S Execute action A t A(S t ) Reward R t+1 R R A Transition to state S t+1
34 / 58 Goals and Reward: Reward Hypothesis [... ] goals and purposes [... ] can be [... ] thought of as the maximization of the expected [... ] sum of a received scalar signal (S&B 2017, p. 51) Rewards come from outside Reward must model task A G
35 / 58 Collecting Reward Over Time: Returns sum of a received scalar signal Reward signals: R t+1, R t+2,... R Return G t = T k=0 R t+1+k Episodical T N Continuing T = Discounted return G t = k=0 γk R t+1+k Maximize E[G t ] A G
36 / 58 Environment and States whatever information is available to the agent State representation S Markov property G p(s t+1, R t+1 S 0, A 0, R 1,..., R t, S t, A t) = p(s t+1, R t+1 S t, A t) State signal retains all relevant information One-step dynamics can predict S t+1 and R t+1 A
37 / 58 Formal Framework: Markovian Decision Process Environment: S, state set A, action set p(s, r s, a), one-step dynamics Interaction: Agent: π t (a s) = p(a t = a S t = s), probabilistic policy Learning: Change π t to collect reward over time S&B, 2017, p. 48
38 / 58 Modelling the Task as a MDP S = free cells A { up, down, left, right } p(s, r s, a) move (deterministically) Rt+1 = 1 for goal state R t+1 = 0 for all other states π t deterministic A G
38 / 58 Modelling the Task as a MDP S = free cells A { up, down, left, right } p(s, r s, a) move (deterministically) Rt+1 = 1 for goal state R t+1 = 0 for all other states π t deterministic A G
39 / 58 Characterizing States and Actions by Value State-value function for policy π v π (s) = E π [G t S t = s] = E π [ γ k R t+1+k S t = s] k=0 Action-value function for policy π q π (s, a) = E π [G t S t = s, A t = a] = E π [ γ k R t+1+k S t = s, A t = a] E π [.] follows π k=0
40 / 58 Learning with Value Functions Exact computation: Dynamic Programming Estimation: Monte Carlo Approximation: v π and q π parameterized
41 / 58 Recursive Relationship and Bellman Equation Consistency between v π (s) and v π (s ) for all s S with successor s S v π (s) = E π [G t S t = s] = E π [ γ k R t+1+k S t = s] k=0 = E π [R t+1 + γ γ k R t+1+1+k S t = s] k=0 = a,r,s p(s, r s, a)π(a s)(r + γe π [G t+1 S t+1 = s ]) = a,r,s p(s, r s, a)π(a s)(r + γv π (s ))
41 / 58 Recursive Relationship and Bellman Equation Consistency between v π (s) and v π (s ) for all s S with successor s S v π (s) = E π [G t S t = s] = E π [ γ k R t+1+k S t = s] k=0 = E π [R t+1 + γ γ k R t+1+1+k S t = s] k=0 = a,r,s p(s, r s, a)π(a s)(r + γe π [G t+1 S t+1 = s ]) = a,r,s p(s, r s, a)π(a s)(r + γv π (s ))
42 / 58 Information Backup Bellman equation for v π s S : v π (s) = a,r,s p(s, r s, a)π(a s)(r + γv π (s )) Reinforcement Learning is based on backup operations (mostly) Backup diagram (for v π and q π ) (S&B, 2017, p. 64)
43 / 58 Optimal Value Functions Partial order on policies π π s S : v π (s) v π (s) Optimal policies π Optimal state-value function v s S : v (s) = max v π(s) π Optimal action-value function q s S, a A(s): q (s, a) = max π q π(s, a)
Bellman Optimality Equation For v For q v (s) = max q π (s, a) a A(s) = max p(s, r s, a)(r + γv (s )) a A(s) s,r q (s, a) = p(s, r s, a)(r + γ max q (s, a )) s a,r A(s ) Backup diagram (for v and q ) S&B, 2017, p. 69) 44 / 58
45 / 58 Learning with Optimal Value Functions System of equations One equation per state Explicitly solve Greedy policy is optimal π (s) = arg max p(s, r s, a)(r + γv (s )) a A(s) s,r π (s) = arg max a A(s) q (s, a) Only one-step-ahead search Based on v or q Without knowledge of dynamics
46 / 58 Table of Contents Today s Goal Reinforcement Learning Problems & Agents Intuition Terminology History and Structure of the Field Formalization Informal Examples Dynamic Programming Monte Carlo Symmary Material
47 / 58 Modelling the Task as a MDP Simplification: deterministic S = free cells A { up, down, left, right } p(s, r s, a) move (deterministically) r = 1 reaching goal state r = 1 for all other states π t deterministically A G
48 / 58 Dynamic Programming Model-based p(s s, a) known r(s, a, s ) known Iteratively approximate v Get π from v Table representation for v One-step backup for each state Value Iteration algorithm a : for k = 1, 2,... do foreach s S do V k (s) max a A {r(s, a, s ) + γv k 1 (s )}; end end a For proof see later session.
49 / 58 Map 3 3 From S&B, 2017
Value Backup 50 / 58
Value Backup 50 / 58
Value Backup 50 / 58
Value Backup 50 / 58
Reward Signal Design 51 / 58
52 / 58 Monte Carlo Model-free p(s s, a) not known r(s, a, s ) not known Approximate q... From actual episodes e.g. (s, a, r, s ) Table representation for q One-step backup for each trace Q Learning algorithm a : for until convergence do Execute some episode ; Q(s, a) r + γ max a A(s ) Q(s, a ); end a For proof see later session. Do several episodes Exploration and exploitation
53 / 58 Map 4 4 From S&B, 2017
54 / 58 Table of Contents Today s Goal Reinforcement Learning Problems & Agents Intuition Terminology History and Structure of the Field Formalization Informal Examples Dynamic Programming Monte Carlo Symmary Material
55 / 58 Symmary Reinforcement learning Learning type, not a method Exploration and exploitation Important characteristics Interaction, reward describes goal Challenges Delayed consequences, credit assignment Formalization MDP Value functions Bellman (optimality) equations
56 / 58 Table of Contents Today s Goal Reinforcement Learning Problems & Agents Intuition Terminology History and Structure of the Field Formalization Informal Examples Dynamic Programming Monte Carlo Symmary Material
57 / 58 Material I Martin Lauer and Martin Riedmiller. Reinforcement learning for stochastic cooperative multi-agent systems. In: Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems-Volume 3. IEEE Computer Society. 2004, pp. 1516 1517. Jan Mattner, Sascha Lange, and Martin Riedmiller. Learn to swing up and balance a real pole based on raw visual input data. In: International Conference on Neural Information Processing. Springer. 2012, pp. 126 133. Martin Riedmiller et al. Karlsruhe brainstormers-a reinforcement learning approach to robotic soccer. In: Robot Soccer World Cup. Springer. 2000, pp. 367 372.
58 / 58 Material II Martin Riedmiller. Neural reinforcement learning to swing-up and balance a real pole. In: 2005 IEEE International Conference on Systems, Man and Cybernetics. Vol. 4. IEEE. 2005, pp. 3191 3196. Russell and Norvig. Artificial intelligence: a modern approach. Vol. 2. Prentice Hall, 2003. Gerald Tesauro. Temporal Difference Learning and TD-Gammon. In: Commun. ACM 38.3 (Mar. 1995), pp. 58 68. ISSN: 0001-0782. DOI: 10.1145/203330.203343. URL: http://doi.acm.org/10.1145/203330.203343. Gerald Tesauro and Terrence J. Sejnowski. A parallel network that learns to play backgammon. In: Artificial Intelligence 39.3 (1989), pp. 357 390.