An Introduction to Reinforcement Learning

Similar documents
CS599 Lecture 1 Introduction To RL

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning. George Konidaris

CS 570: Machine Learning Seminar. Fall 2016

Lecture 23: Reinforcement Learning

Lecture 1: March 7, 2018

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Machine Learning I Reinforcement Learning

Reinforcement learning an introduction

Reinforcement Learning. Yishay Mansour Tel-Aviv University

6 Reinforcement Learning

Grundlagen der Künstlichen Intelligenz

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning

Temporal difference learning

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning II

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Reinforcement Learning and Deep Reinforcement Learning

Reinforcement Learning

Reinforcement Learning: An Introduction

Reinforcement Learning: the basics

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Open Theoretical Questions in Reinforcement Learning

Decision Theory: Markov Decision Processes

Reinforcement Learning

Basics of reinforcement learning

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Introduction to Reinforcement Learning

Notes on Reinforcement Learning

Reinforcement Learning

Internet Monetization

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Artificial Intelligence & Sequential Decision Problems

Reinforcement Learning

Planning in Markov Decision Processes

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Reinforcement Learning

Introduction to Reinforcement Learning

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Lecture 7: Value Function Approximation

Reinforcement Learning (1)

Grundlagen der Künstlichen Intelligenz

Lecture 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem

Reinforcement Learning Part 2

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

Chapter 3: The Reinforcement Learning Problem

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Lecture 8: Policy Gradient

15-780: ReinforcementLearning

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Reinforcement Learning and Control

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Temporal Difference Learning & Policy Iteration

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Reinforcement Learning. Value Function Updates

Reinforcement Learning

arxiv: v1 [cs.ai] 5 Nov 2017

Reinforcement learning

REINFORCEMENT LEARNING

RL 3: Reinforcement Learning

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Reinforcement Learning. Introduction

Decision Theory: Q-Learning

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Reinforcement Learning

Grundlagen der Künstlichen Intelligenz

Generalization and Function Approximation

The Nature of Learning - A study of Reinforcement Learning Methodology

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Markov Decision Processes

Reinforcement Learning

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

CS 598 Statistical Reinforcement Learning. Nan Jiang

, and rewards and transition matrices as shown below:

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

An online kernel-based clustering approach for value function approximation

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Lecture 3: Markov Decision Processes

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Elements of Reinforcement Learning

Reinforcement Learning

Reinforcement Learning and NLP

A Gentle Introduction to Reinforcement Learning

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

16.4 Multiattribute Utility Functions

Reinforcement learning

CS 4100 // artificial intelligence. Recap/midterm review!

Transcription:

1 / 58 An Introduction to Reinforcement Learning Lecture 01: Introduction Dr. Johannes A. Stork School of Computer Science and Communication KTH Royal Institute of Technology January 19, 2017

2 / 58

../fig/reward-00.jpg 3 / 58

../fig/reward-01.jpg 3 / 58

../fig/reward-03.jpg 3 / 58

../fig/reward-04.jpg 3 / 58

../fig/marshmellow-01.jpg 4 / 58

Play video 5 / 58

6 / 58 Marshmallow Experiment and Delayed Gratification SAT scores Mischel, Walter; Shoda, Yuichi; Rodriguzez, Monica L. (1989). Delay of gratification in children.. Science. 244: 933 938. Educational attainment Ayduk, Ozlem N.; Mendoa-Denton, Rodolfo; Mischel, Walter; Downey, Geraldine; Peake, Philip K.; Rodriguez, Monica L. (2000). Regulating the interpersonal self: Strategic self-regulation for coping with rejection sensitivity. Journal of Personality and Social Psychology. 79 (5): 776 792. Body mass index Schlam, Tanya R.; Wilson, Nicole L.; Shoda, Yuichi; Mischel, Walter; Ayduk, Ozlem (2013). Preschoolers delay of gratification predicts their body mass 30 years later. The Journal of Pediatrics. 162: 90 93. Other life measures Shoda, Yuichi; Mischel, Walter; Peake, Philip K. (1990). Predicting Adolescent Cognitive and Self-Regulatory Competencies from Preschool Delay of Gratification: Identifying Diagnostic Conditions. Developmental Psychology. 26 (6): 978 986.

7 / 58 Table of Contents Today s Goal Reinforcement Learning Problems & Agents Intuition Terminology History and Structure of the Field Formalization Informal Examples Dynamic Programming Monte Carlo Symmary Material

8 / 58 Today s Goal Intuition RL problems RL agents Terminology History Formalization Examples

9 / 58 Table of Contents Today s Goal Reinforcement Learning Problems & Agents Intuition Terminology History and Structure of the Field Formalization Informal Examples Dynamic Programming Monte Carlo Symmary Material

10 / 58 Table of Contents Today s Goal Reinforcement Learning Problems & Agents Intuition Terminology History and Structure of the Field Formalization Informal Examples Dynamic Programming Monte Carlo Symmary Material

11 / 58 What is Reinforcement Learning? Goal-directed learning...... from interaction with the environment. Learn how to map situations to actions...... in order to maximize some reward.

11 / 58 What is Reinforcement Learning? Goal-directed learning... want something... from interaction with the environment. actions change collect new experience Learn how to map situations to actions... encode & improve behavior... in order to maximize some reward. goal is implicitly given some (external) signal

12 / 58 Characteristics of RL Problems 1. Actions influence later inputs (i.e. closed-loop) 2. No direct instructions (only reward signal) 3. Consequences of actions play out over (long) time (horizon)

13 / 58 Winning in Car Racing../fig/f1.jpg 1. Actions influence later inputs (i.e. closed-loop) 2. No direct instructions (only reward signal) 3. Consequences of actions play out over (long) time (horizon)

13 / 58 Winning in Car Racing../fig/f1.jpg 1. Switch gear, break, steer 2. Win! (Don t crash!, Stay on the track!, Respect Safety Car!,... ) 3. Pit stop timing, choice of tire

14 / 58 Agents in Reinforcement Learning 1. Sense state of environment 2. Take action that affects state 3. Have a goal relating to the state of the environment../fig/agent.png An agent is anything that can be viewed as perceiving its environment through sensors and acting upon that environment through effectors. Russell and Norvig [RN03, pp. 32, 33]

15 / 58 Comparing to Supervised and Unsupervised Learning Supervised Learning Unsupervised Learning../fig/slvsul.jpg

Comparing to Supervised and Unsupervised Learning Supervised Learning (e.g. classification, regression, ranking) Labeled examples from supervisor E.g. {(situation, action) i } i Learning task: generalize & extrapolate Unsupervised Learning (e.g. clustering, segmentation, dimensionality reduction) Unlabeled examples Learning task: find hidden structure 15 / 58

Comparing to Supervised and Unsupervised Learning Supervised Learning (e.g. classification, regression, ranking) Labeled examples from supervisor E.g. {(situation, action) i } i Learning task: generalize & extrapolate Not learning from interaction (In RL we cannot sample) Unsupervised Learning (e.g. clustering, segmentation, dimensionality reduction) Unlabeled examples Learning task: find hidden structure RL tries to maximize reward 15 / 58

16 / 58 Exploration and Exploitation../fig/cow_small.jpg Maximize reward exploit knowledge about rewarding actions Discover reward maximizing actions explore new actions

16 / 58 Exploration and Exploitation../fig/cow_small.jpg Maximize reward exploit knowledge about rewarding actions Discover reward maximizing actions explore new actions Need for tradeoff (Not in Supervised or Unsupervised Learning)

17 / 58 Exploration and Exploitation in Car Racing../fig/f1.jpg Exploitation:? Exploration:?

17 / 58 Exploration and Exploitation in Car Racing../fig/f1_crash.jpg Exploitation:? Exploration:? Both exploration and exploitation can fail

17 / 58 Exploration and Exploitation in Car Racing../fig/f1_crash.jpg Exploitation:? Exploration:? Both exploration and exploitation can fail but we can learn from failing too

18 / 58 Challenges in Reinforcement Learning../fig/f1.jpg Interaction with environment Uncertainty of situation (perception & state estimation) Delayed consequences (requires foresight & planning) Effects of actions cannot fully be predicted (stochastic environment) Measuring goal for (immediate) reward signal

18 / 58 Challenges in Reinforcement Learning Neurogammon, TD-Gammon: World champion level [TS89; Tes95] Interaction with environment Uncertainty of situation (perception & state estimation) Delayed consequences (requires foresight & planning) Effects of actions cannot fully be predicted (stochastic environment) Measuring goal for (immediate) reward signal

18 / 58 Challenges in Reinforcement Learning Neural Controller: Within a small number of trails [Rie05; MLR12] Interaction with environment Uncertainty of situation (perception & state estimation) Delayed consequences (requires foresight & planning) Effects of actions cannot fully be predicted (stochastic environment) Measuring goal for (immediate) reward signal

18 / 58 Challenges in Reinforcement Learning Cooperative RL Agents [Rie+00; LR04] Interaction with environment Uncertainty of situation (perception & state estimation) Delayed consequences (requires foresight & planning) Effects of actions cannot fully be predicted (stochastic environment) Measuring goal for (immediate) reward signal

18 / 58 Challenges in Reinforcement Learning../fig/robotchess.jpg../fig/robotsoccer.jpg Interaction with environment Uncertainty of situation (perception & state estimation) Delayed consequences (requires foresight & planning) Effects of actions cannot fully be predicted (stochastic environment) Measuring goal for (immediate) reward signal

19 / 58 Elements of Reinforcement Learning Policy Reward signal Value function Model 1 Reward and value don t have to agree

19 / 58 Elements of Reinforcement Learning Policy Defines behavior at a given time Similar to stimulus-response rule Reward signal Value function Model 1 Reward and value don t have to agree

19 / 58 Elements of Reinforcement Learning Policy Defines behavior at a given time Similar to stimulus-response rule Reward signal Defines goal/ what is to be achieved (indirectly) Agent wants to maximize total reward over running time Immediate feedback for situation Value function Model 1 Reward and value don t have to agree

19 / 58 Elements of Reinforcement Learning Policy Defines behavior at a given time Similar to stimulus-response rule Reward signal Defines goal/ what is to be achieved (indirectly) Agent wants to maximize total reward over running time Immediate feedback for situation Value function What is good in the long run E.g. total reward until end of interaction More foresighted evaluation of situation Model 1 Reward and value don t have to agree

19 / 58 Elements of Reinforcement Learning Policy Defines behavior at a given time Similar to stimulus-response rule Reward signal Defines goal/ what is to be achieved (indirectly) Agent wants to maximize total reward over running time Immediate feedback for situation Value function What is good in the long run E.g. total reward until end of interaction More foresighted evaluation of situation Main problem: efficiently estimating values Model 1 Reward and value don t have to agree

19 / 58 Elements of Reinforcement Learning Policy Defines behavior at a given time Similar to stimulus-response rule Reward signal Defines goal/ what is to be achieved (indirectly) Agent wants to maximize total reward over running time Immediate feedback for situation Value function What is good in the long run E.g. total reward until end of interaction More foresighted evaluation of situation Main problem: efficiently estimating values Model Mimics behavior of environment Allows inference Model-based vs. model-free 1 Reward and value don t have to agree

20 / 58 Examples../fig/f1.jpg../fig/robotchess.jpg../fig/robotsoccer.jp What is are good policy, reward signal, value function, model?

21 / 58 Optimization and Optimality Agent tries to maximize reward (i.e. optimization)

21 / 58 Optimization and Optimality Agent tries to maximize reward (i.e. optimization) Optimality might be impossible (theoretically or practically)

22 / 58 Table of Contents Today s Goal Reinforcement Learning Problems & Agents Intuition Terminology History and Structure of the Field Formalization Informal Examples Dynamic Programming Monte Carlo Symmary Material

23 / 58 Three Different Threads 1. Learning by trail and error 2. Optimal control 3. Temporal-difference methods Joined in 1980 s to from modern Reinforcement Learning.

24 / 58 Learning by Trail and Error Concepts of animal behavior (since 1850 s) Law of effect The greater satisfaction or discomfort, the greater strengthening or weakening of bond (between situation and feedback) (Thorndike) Reinforcement Strengthening of a pattern of behavior as a result of stimulus (Pavlov 1927) In Computational Intelligence Pleasure-pain system (Turing 1948) Record connections between configurations based on feedback Electro-mechanical machines Find path in mazes

25 / 58 Optimal Control Designing a controller to minimize a measure of a dynamical system over time (since 1950 s) Use Bellman equation and dynamic programming Formalization Markovian decision process Dynamic programming Efficient tabulation algorithms Regarded as only feasible way to solve general stochastic optimal control problems Curse of dimensionality (computation grows exponentially in number of variables)

25 / 58 Optimal Control Designing a controller to minimize a measure of a dynamical system over time (since 1950 s) Use Bellman equation and dynamic programming Formalization Markovian decision process Dynamic programming Efficient tabulation algorithms Regarded as only feasible way to solve general stochastic optimal control problems Curse of dimensionality (computation grows exponentially in number of variables)

26 / 58 Temporal-difference Methods Driven by difference between temporally successive estimates of same quantity Probability of winning a game Value of a state Unique to Reinforcement Learning TD Q learning SARSA Eligibility traces, TD(λ), Q(λ), SARSA(λ)

27 / 58 Structure of the Field of Reinforcement Learning 1 Small problem Large problem Tabular method Approximation method Finite MDP Bandit problem Value function Policy Bellman equations Policy gradient Dynamic Programming Monte Carlo Temporaldifference On-policy Off-policy (Only half the truth.) Eligibility traces 1 As in S&B, 2017

Classification of Problems and Environments Property Tabular method Approximative method DP. MC. TD.... Problem Stationary [yes/ no] State space [cont./ dis.] dis. dis. dis. Action space [cont./ dis.] dis. dis. dis. State aliasing [yes/ no] no no no Observable [yes/ no] yes yes yes Feedback [instr./ eval.] eval. eval. eval. eval. Associative [yes/ no] yes yes yes yes Return [epis./ cont./ dis.] Value at [action, state, after]. (More on p. 405 in S&B, 2017) 28 / 58

29 / 58 Classification of Methods Property Tabular method Approximative method DP. MC. TD.... Method Solution [approx./ exact] exact Incremental + - + Analysis + + - Model-based yes no no Learning [off/ on policy] Action selection/ exploration On-policy Off-policy Bootstraping Uses actual experience Backup type all actual Value function exact exact exact approx. Policy exact exact exact approx. or exact. (More on p. 405 in S&B, 2017)

29 / 58 Classification of Methods Property Tabular method Approximative method DP. MC. TD.... Method Solution [approx./ exact] exact Incremental + - + Analysis + + - Model-based yes no no Learning [off/ on policy] Action selection/ exploration On-policy Off-policy Bootstraping Uses actual experience Backup type all actual Value function exact exact exact approx. Policy exact exact exact approx. or exact. (More on p. 405 in S&B, 2017)

30 / 58 Backup Diagrams 2 2 From S&B, 2017

31 / 58 Table of Contents Today s Goal Reinforcement Learning Problems & Agents Intuition Terminology History and Structure of the Field Formalization Informal Examples Dynamic Programming Monte Carlo Symmary Material

32 / 58 Path Finding in Grid-based Maze G A Four deterministic motions (up, down, left, right) Free space, walls, position observable Goal position (G)

33 / 58 Interaction Between Agent and Environment G S&B, 2017 p. 48 Time steps t = 0, 1, 2,... Sense state S t S Execute action A t A(S t ) Reward R t+1 R R A Transition to state S t+1

34 / 58 Goals and Reward: Reward Hypothesis [... ] goals and purposes [... ] can be [... ] thought of as the maximization of the expected [... ] sum of a received scalar signal (S&B 2017, p. 51) Rewards come from outside Reward must model task A G

35 / 58 Collecting Reward Over Time: Returns sum of a received scalar signal Reward signals: R t+1, R t+2,... R Return G t = T k=0 R t+1+k Episodical T N Continuing T = Discounted return G t = k=0 γk R t+1+k Maximize E[G t ] A G

36 / 58 Environment and States whatever information is available to the agent State representation S Markov property G p(s t+1, R t+1 S 0, A 0, R 1,..., R t, S t, A t) = p(s t+1, R t+1 S t, A t) State signal retains all relevant information One-step dynamics can predict S t+1 and R t+1 A

37 / 58 Formal Framework: Markovian Decision Process Environment: S, state set A, action set p(s, r s, a), one-step dynamics Interaction: Agent: π t (a s) = p(a t = a S t = s), probabilistic policy Learning: Change π t to collect reward over time S&B, 2017, p. 48

38 / 58 Modelling the Task as a MDP S = free cells A { up, down, left, right } p(s, r s, a) move (deterministically) Rt+1 = 1 for goal state R t+1 = 0 for all other states π t deterministic A G

38 / 58 Modelling the Task as a MDP S = free cells A { up, down, left, right } p(s, r s, a) move (deterministically) Rt+1 = 1 for goal state R t+1 = 0 for all other states π t deterministic A G

39 / 58 Characterizing States and Actions by Value State-value function for policy π v π (s) = E π [G t S t = s] = E π [ γ k R t+1+k S t = s] k=0 Action-value function for policy π q π (s, a) = E π [G t S t = s, A t = a] = E π [ γ k R t+1+k S t = s, A t = a] E π [.] follows π k=0

40 / 58 Learning with Value Functions Exact computation: Dynamic Programming Estimation: Monte Carlo Approximation: v π and q π parameterized

41 / 58 Recursive Relationship and Bellman Equation Consistency between v π (s) and v π (s ) for all s S with successor s S v π (s) = E π [G t S t = s] = E π [ γ k R t+1+k S t = s] k=0 = E π [R t+1 + γ γ k R t+1+1+k S t = s] k=0 = a,r,s p(s, r s, a)π(a s)(r + γe π [G t+1 S t+1 = s ]) = a,r,s p(s, r s, a)π(a s)(r + γv π (s ))

41 / 58 Recursive Relationship and Bellman Equation Consistency between v π (s) and v π (s ) for all s S with successor s S v π (s) = E π [G t S t = s] = E π [ γ k R t+1+k S t = s] k=0 = E π [R t+1 + γ γ k R t+1+1+k S t = s] k=0 = a,r,s p(s, r s, a)π(a s)(r + γe π [G t+1 S t+1 = s ]) = a,r,s p(s, r s, a)π(a s)(r + γv π (s ))

42 / 58 Information Backup Bellman equation for v π s S : v π (s) = a,r,s p(s, r s, a)π(a s)(r + γv π (s )) Reinforcement Learning is based on backup operations (mostly) Backup diagram (for v π and q π ) (S&B, 2017, p. 64)

43 / 58 Optimal Value Functions Partial order on policies π π s S : v π (s) v π (s) Optimal policies π Optimal state-value function v s S : v (s) = max v π(s) π Optimal action-value function q s S, a A(s): q (s, a) = max π q π(s, a)

Bellman Optimality Equation For v For q v (s) = max q π (s, a) a A(s) = max p(s, r s, a)(r + γv (s )) a A(s) s,r q (s, a) = p(s, r s, a)(r + γ max q (s, a )) s a,r A(s ) Backup diagram (for v and q ) S&B, 2017, p. 69) 44 / 58

45 / 58 Learning with Optimal Value Functions System of equations One equation per state Explicitly solve Greedy policy is optimal π (s) = arg max p(s, r s, a)(r + γv (s )) a A(s) s,r π (s) = arg max a A(s) q (s, a) Only one-step-ahead search Based on v or q Without knowledge of dynamics

46 / 58 Table of Contents Today s Goal Reinforcement Learning Problems & Agents Intuition Terminology History and Structure of the Field Formalization Informal Examples Dynamic Programming Monte Carlo Symmary Material

47 / 58 Modelling the Task as a MDP Simplification: deterministic S = free cells A { up, down, left, right } p(s, r s, a) move (deterministically) r = 1 reaching goal state r = 1 for all other states π t deterministically A G

48 / 58 Dynamic Programming Model-based p(s s, a) known r(s, a, s ) known Iteratively approximate v Get π from v Table representation for v One-step backup for each state Value Iteration algorithm a : for k = 1, 2,... do foreach s S do V k (s) max a A {r(s, a, s ) + γv k 1 (s )}; end end a For proof see later session.

49 / 58 Map 3 3 From S&B, 2017

Value Backup 50 / 58

Value Backup 50 / 58

Value Backup 50 / 58

Value Backup 50 / 58

Reward Signal Design 51 / 58

52 / 58 Monte Carlo Model-free p(s s, a) not known r(s, a, s ) not known Approximate q... From actual episodes e.g. (s, a, r, s ) Table representation for q One-step backup for each trace Q Learning algorithm a : for until convergence do Execute some episode ; Q(s, a) r + γ max a A(s ) Q(s, a ); end a For proof see later session. Do several episodes Exploration and exploitation

53 / 58 Map 4 4 From S&B, 2017

54 / 58 Table of Contents Today s Goal Reinforcement Learning Problems & Agents Intuition Terminology History and Structure of the Field Formalization Informal Examples Dynamic Programming Monte Carlo Symmary Material

55 / 58 Symmary Reinforcement learning Learning type, not a method Exploration and exploitation Important characteristics Interaction, reward describes goal Challenges Delayed consequences, credit assignment Formalization MDP Value functions Bellman (optimality) equations

56 / 58 Table of Contents Today s Goal Reinforcement Learning Problems & Agents Intuition Terminology History and Structure of the Field Formalization Informal Examples Dynamic Programming Monte Carlo Symmary Material

57 / 58 Material I Martin Lauer and Martin Riedmiller. Reinforcement learning for stochastic cooperative multi-agent systems. In: Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems-Volume 3. IEEE Computer Society. 2004, pp. 1516 1517. Jan Mattner, Sascha Lange, and Martin Riedmiller. Learn to swing up and balance a real pole based on raw visual input data. In: International Conference on Neural Information Processing. Springer. 2012, pp. 126 133. Martin Riedmiller et al. Karlsruhe brainstormers-a reinforcement learning approach to robotic soccer. In: Robot Soccer World Cup. Springer. 2000, pp. 367 372.

58 / 58 Material II Martin Riedmiller. Neural reinforcement learning to swing-up and balance a real pole. In: 2005 IEEE International Conference on Systems, Man and Cybernetics. Vol. 4. IEEE. 2005, pp. 3191 3196. Russell and Norvig. Artificial intelligence: a modern approach. Vol. 2. Prentice Hall, 2003. Gerald Tesauro. Temporal Difference Learning and TD-Gammon. In: Commun. ACM 38.3 (Mar. 1995), pp. 58 68. ISSN: 0001-0782. DOI: 10.1145/203330.203343. URL: http://doi.acm.org/10.1145/203330.203343. Gerald Tesauro and Terrence J. Sejnowski. A parallel network that learns to play backgammon. In: Artificial Intelligence 39.3 (1989), pp. 357 390.